- Academic Search

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Enregistrer Citer Cité 68 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Enregistrer Citer Cité 34 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Omnigen: Unified image generation

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

Enregistrer Citer Cité 25 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Enregistrer Citer Cité 22 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation

Z Luo, F Shi, Y Ge, Y Yang, L Wang, Y Shan - arxiv preprint arxiv …, 2024 - arxiv.org

We present Open-MAGVIT2, a family of auto-regressive image generation models ranging
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …

Enregistrer Citer Cité 27 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Hart: Efficient visual generation with hybrid autoregressive transformer

H Tang, Y Wu, S Yang, E **e, J Chen, J Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual
generation model capable of directly generating 1024x1024 images, rivaling diffusion …

Enregistrer Citer Cité 16 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Enregistrer Citer Cité 2 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Dart: Denoising autoregressive transformer for scalable text-to-image generation

J Gu, Y Wang, Y Zhang, Q Zhang, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Diffusion models have become the dominant approach for visual generation. They are
trained by denoising a Markovian process which gradually adds noise to the input. We …

Enregistrer Citer Cité 9 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Imagefolder: Autoregressive image generation with folded tokens

X Li, K Qiu, H Chen, J Kuen, J Gu, B Raj… - arxiv preprint arxiv …, 2024 - arxiv.org

Image tokenizers are crucial for visual generative models, eg, diffusion models (DMs) and
autoregressive (AR) models, as they construct the latent representation for modeling …

Enregistrer Citer Cité 10 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Enregistrer Citer Cité 6 fois Autres articles Les 2 versions Free GPT-4 Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Show-o: One single transformer to unify multimodal understanding and generation

Emu3: Next-token prediction is all you need

Vila-u: a unified foundation model integrating visual understanding and generation

Omnigen: Unified image generation

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation

Hart: Efficient visual generation with hybrid autoregressive transformer

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Dart: Denoising autoregressive transformer for scalable text-to-image generation

Imagefolder: Autoregressive image generation with folded tokens

Wavchat: A survey of spoken dialogue models