Lavie: High-quality video generation with cascaded latent diffusion models
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …
PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …
Photorealistic video generation with diffusion models
We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to …
prompts. Our approach has two key design decisions. First, we use a causal encoder to …
Photomaker: Customizing realistic human photos via stacked id embedding
Recent advances in text-to-image generation have made remarkable progress in
synthesizing realistic human photos conditioned on given text prompts. However existing …
synthesizing realistic human photos conditioned on given text prompts. However existing …
Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Show-o: One single transformer to unify multimodal understanding and generation
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models
In this paper, we study the harmlessness alignment problem of multimodal large language
models (MLLMs). We conduct a systematic empirical analysis of the harmlessness …
models (MLLMs). We conduct a systematic empirical analysis of the harmlessness …
Textdiffuser-2: Unleashing the power of language models for text rendering
The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …
remains a challenge in generating visual text. Although existing work has endeavored to …
Genai arena: An open evaluation platform for generative models
Generative AI has made remarkable strides to revolutionize fields such as image and video
generation. These advancements are driven by innovative algorithms, architecture, and …
generation. These advancements are driven by innovative algorithms, architecture, and …
Vila-u: a unified foundation model integrating visual understanding and generation
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …
and generation. Traditional visual language models (VLMs) use separate modules for …