- Academic Search

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arxiv preprint arxiv:2405.21060, 2024 - arxiv.org

While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

Lagre Referanse Sitert av 328 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Autoregressive model beats diffusion: Llama for scalable image generation

P Sun, Y Jiang, S Chen, S Zhang, B Peng… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce LlamaGen, a new family of image generation models that apply original``next-
token prediction''paradigm of large language models to visual generation domain. It is an …

Lagre Referanse Sitert av 132 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Lagre Referanse Sitert av 131 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks

J Wu, M Zhong, S **ng, Z Lai, Z Liu… - Advances in …, 2025 - proceedings.neurips.cc

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

Lagre Referanse Sitert av 33 Beslektede artikler Alle 5 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Lagre Referanse Sitert av 90 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Lagre Referanse Sitert av 76 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[HTML] mdpi.com

[HTML][HTML] A survey of robot intelligence with large language models

H Jeong, H Lee, C Kim, S Shin - Applied Sciences, 2024 - mdpi.com

Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …

Lagre Referanse Sitert av 8 Beslektede artikler Alle 2 versjoner Bufret

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Lagre Referanse Sitert av 32 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vision language models are blind

P Rahmanzadehgervi, L Bolton… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models (LLMs) with vision capabilities (eg, GPT-4o, Gemini 1.5, and Claude
3) are powering countless image-text processing applications, enabling unprecedented …

Lagre Referanse Sitert av 44 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards semantic equivalence of tokenization in multimodal llm

S Wu, H Fei, X Li, J Ji, H Zhang, TS Chua… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …

Lagre Referanse Sitert av 48 Beslektede artikler Alle 3 versjoner HTML-versjon

Referanse

Avansert søk

Lagret i Mitt bibliotek

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Autoregressive model beats diffusion: Llama for scalable image generation

Paligemma: A versatile 3b vlm for transfer

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks

Show-o: One single transformer to unify multimodal understanding and generation

Emu3: Next-token prediction is all you need

[HTML][HTML] A survey of robot intelligence with large language models

Longvila: Scaling long-context visual language models for long videos

Vision language models are blind

Towards semantic equivalence of tokenization in multimodal llm