Lavie: High-quality video generation with cascaded latent diffusion models

Y Wang, X Chen, X Ma, S Zhou, Z Huang… - International Journal of …, 2024 - Springer
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …

PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

J Chen, C Ge, E **e, Y Wu, L Yao, X Ren… - … on Computer Vision, 2024 - Springer
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …

Photorealistic video generation with diffusion models

A Gupta, L Yu, K Sohn, X Gu, M Hahn, FF Li… - … on Computer Vision, 2024 - Springer
We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to …

Photomaker: Customizing realistic human photos via stacked id embedding

Z Li, M Cao, X Wang, Z Qi… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advances in text-to-image generation have made remarkable progress in
synthesizing realistic human photos conditioned on given text prompts. However existing …

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Y Li, H Guo, K Zhou, WX Zhao, JR Wen - European Conference on …, 2024 - Springer
In this paper, we study the harmlessness alignment problem of multimodal large language
models (MLLMs). We conduct a systematic empirical analysis of the harmlessness …

Textdiffuser-2: Unleashing the power of language models for text rendering

J Chen, Y Huang, T Lv, L Cui, Q Chen, F Wei - European Conference on …, 2024 - Springer
The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …

Genai arena: An open evaluation platform for generative models

D Jiang, M Ku, T Li, Y Ni, S Sun… - Advances in Neural …, 2025 - proceedings.neurips.cc
Generative AI has made remarkable strides to revolutionize fields such as image and video
generation. These advancements are driven by innovative algorithms, architecture, and …

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …