Google Académico

Z Zheng, X Peng, T Yang, C Shen, S Li, H Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision and language are the two foundational senses for humans, and they build up our
cognitive ability and intelligence. While significant breakthroughs have been made in AI …

Guardar Citar Citado por 69 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Guardar Citar Citado por 66 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Dreamlip: Language-image pre-training with long captions

K Zheng, Y Zhang, W Wu, F Lu, S Ma, X **… - … on Computer Vision, 2024 - Springer

Abstract Language-image pre-training largely relies on how precisely and thoroughly a text
describes its paired image. In practice, however, the contents of an image can be so rich that …

Guardar Citar Citado por 22 Artículos relacionados Las 2 versiones

[Free GPT-4]

[PDF] arxiv.org

Miradata: A large-scale video dataset with long durations and structured captions

X Ju, Y Gao, Z Zhang, Z Yuan, X Wang, A Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Sora's high-motion intensity and long consistent videos have significantly impacted the field
of video generation, attracting unprecedented attention. However, existing publicly available …

Guardar Citar Citado por 28 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Guardar Citar Citado por 26 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Representation alignment for generation: Training diffusion transformers is easier than you think

S Yu, S Kwak, H Jang, J Jeong, J Huang, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …

Guardar Citar Citado por 23 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Efficient diffusion transformer with step-wise dynamic attention mediators

Y Pu, Z **a, J Guo, D Han, Q Li, D Li, Y Yuan… - … on Computer Vision, 2024 - Springer

This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …

Guardar Citar Citado por 8 Artículos relacionados Las 6 versiones

[Free GPT-4]

[PDF] arxiv.org

Deep compression autoencoder for efficient high-resolution diffusion models

J Chen, H Cai, J Chen, E **e, S Yang, H Tang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models
for accelerating high-resolution diffusion models. Existing autoencoder models have …

Guardar Citar Citado por 9 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models

M Li, Y Lin, Z Zhang, T Cai, X Li, J Guo, E **e… - arxiv preprint arxiv …, 2024 - arxiv.org

Diffusion models have been proven highly effective at generating high-quality images.
However, as these models grow larger, they require significantly more memory and suffer …

Guardar Citar Citado por 8 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Open-sora plan: Open-source large video generation model

B Lin, Y Ge, X Cheng, Z Li, B Zhu, S Wang, X He… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Open-Sora Plan, an open-source project that aims to contribute a large
generation model for generating desired high-resolution videos with long durations based …

Guardar Citar Citado por 9 Artículos relacionados Las 2 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Open-sora: Democratizing efficient video production for all

Emu3: Next-token prediction is all you need

Dreamlip: Language-image pre-training with long captions

Miradata: A large-scale video dataset with long durations and structured captions

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

Representation alignment for generation: Training diffusion transformers is easier than you think

Efficient diffusion transformer with step-wise dynamic attention mediators

Deep compression autoencoder for efficient high-resolution diffusion models

Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models

Open-sora plan: Open-source large video generation model