Google Akademik

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Kaydet Alıntı yap Alıntılanma sayısı: 108 İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Kaydet Alıntı yap Alıntılanma sayısı: 68 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Kaydet Alıntı yap Alıntılanma sayısı: 72 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[HTML] mdpi.com

[HTML][HTML] A survey of robot intelligence with large language models

H Jeong, H Lee, C Kim, S Shin - Applied Sciences, 2024 - mdpi.com

Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …

Kaydet Alıntı yap Alıntılanma sayısı: 8 İlgili makaleler 2 sürümün hepsi Önbellek

[Free GPT-4]

[PDF] thecvf.com

Vision language models are blind

P Rahmanzadehgervi, L Bolton… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models (LLMs) with vision capabilities (eg, GPT-4o, Gemini 1.5, and Claude
3) are powering countless image-text processing applications, enabling unprecedented …

Kaydet Alıntı yap Alıntılanma sayısı: 35 İlgili makaleler 5 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Kaydet Alıntı yap Alıntılanma sayısı: 29 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Kaydet Alıntı yap Alıntılanma sayısı: 26 İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Omnigen: Unified image generation

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

Kaydet Alıntı yap Alıntılanma sayısı: 25 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Kaydet Alıntı yap Alıntılanma sayısı: 22 İlgili makaleler HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation

Z Luo, F Shi, Y Ge, Y Yang, L Wang, Y Shan - arxiv preprint arxiv …, 2024 - arxiv.org

We present Open-MAGVIT2, a family of auto-regressive image generation models ranging
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …

Kaydet Alıntı yap Alıntılanma sayısı: 27 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

Alıntı yap

Gelişmiş arama

Kitaplığım'a kaydedildi

Paligemma: A versatile 3b vlm for transfer

Emu3: Next-token prediction is all you need

Show-o: One single transformer to unify multimodal understanding and generation

[HTML][HTML] A survey of robot intelligence with large language models

Vision language models are blind

Longvila: Scaling long-context visual language models for long videos

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

Omnigen: Unified image generation

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation