Google Académico

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024 - Springer

Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

Guardar Citar Citado por 32 Artigos relacionados Todas as 2 versões

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Guardar Citar Citado por 29 Artigos relacionados Todas as 2 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Guardar Citar Citado por 14 Artigos relacionados Todas as 3 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Quest: Self-supervised skill abstractions for learning continuous control

A Mete, H Xue, A Wilcox, Y Chen… - Advances in Neural …, 2025 - proceedings.neurips.cc

Generalization capabilities, or rather a lack thereof, is one of the most important unsolved
problems in the field of robot learning, and while several large scale efforts have set out to …

Guardar Citar Citado por 5 Artigos relacionados Todas as 6 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Maskbit: Embedding-free image generation via bit tokens

M Weber, L Yu, Q Yu, X Deng, X Shen… - arxiv preprint arxiv …, 2024 - arxiv.org

Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …

Guardar Citar Citado por 16 Artigos relacionados Todas as 4 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Y Wang, H Zhan, L Liu, R Zeng, H Guo, J Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …

Guardar Citar Citado por 15 Artigos relacionados Todas as 3 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Image understanding makes for a good tokenizer for image generation

L Wang, Y Zhao, Z Zhang, J Feng… - Advances in Neural …, 2025 - proceedings.neurips.cc

Modern image generation (IG) models have been shown to capture rich semantics valuable
for image understanding (IU) tasks. However, the potential of IU models to improve IG …

Guardar Citar Citado por 2 Artigos relacionados Todas as 4 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng, L Wang - arxiv preprint arxiv:2404.02905, 2024 - arxiv.org

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …

Guardar Citar Citado por 146 Artigos relacionados Todas as 3 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Guardar Citar Citado por 7 Artigos relacionados Todas as 2 versões Ver em HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Adanat: Exploring adaptive policy for token-based image generation

Z Ni, Y Wang, R Zhou, R Lu, J Guo, J Hu, Z Liu… - … on Computer Vision, 2024 - Springer

Recent studies have demonstrated the effectiveness of token-based methods for visual
content generation. As a representative work, non-autoregressive Transformers (NATs) are …

Guardar Citar Citado por 3 Artigos relacionados Todas as 6 versões

Criar alerta

Citar

Pesquisa avançada

Guardado em A minha biblioteca

Finite scalar quantization: Vq-vae made simple

Givt: Generative infinite-vocabulary transformers

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

Emova: Empowering language models to see, hear and speak with vivid emotions

Quest: Self-supervised skill abstractions for learning continuous control

Maskbit: Embedding-free image generation via bit tokens

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Image understanding makes for a good tokenizer for image generation

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Wavchat: A survey of spoken dialogue models

Adanat: Exploring adaptive policy for token-based image generation