- Academic Search

Y Wang, X Chen, X Ma, S Zhou, Z Huang… - International Journal of …, 2024 - Springer

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …

Uložit Citovat Počet citací tohoto článku: 223 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

J Chen, C Ge, E **e, Y Wu, L Yao, X Ren… - … on Computer Vision, 2024 - Springer

In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …

Uložit Citovat Počet citací tohoto článku: 107 Související články Všechny verze (počet: 2)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Photorealistic video generation with diffusion models

A Gupta, L Yu, K Sohn, X Gu, M Hahn, FF Li… - … on Computer Vision, 2024 - Springer

We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to …

Uložit Citovat Počet citací tohoto článku: 131 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Photomaker: Customizing realistic human photos via stacked id embedding

Z Li, M Cao, X Wang, Z Qi… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advances in text-to-image generation have made remarkable progress in
synthesizing realistic human photos conditioned on given text prompts. However existing …

Uložit Citovat Počet citací tohoto článku: 132 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Uložit Citovat Počet citací tohoto článku: 72 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Uložit Citovat Počet citací tohoto článku: 78 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Y Li, H Guo, K Zhou, WX Zhao, JR Wen - European Conference on …, 2024 - Springer

In this paper, we study the harmlessness alignment problem of multimodal large language
models (MLLMs). We conduct a systematic empirical analysis of the harmlessness …

Uložit Citovat Počet citací tohoto článku: 44 Související články Všechny verze (počet: 2)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Textdiffuser-2: Unleashing the power of language models for text rendering

J Chen, Y Huang, T Lv, L Cui, Q Chen, F Wei - European Conference on …, 2024 - Springer

The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …

Uložit Citovat Počet citací tohoto článku: 43 Související články Všechny verze (počet: 2)

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Genai arena: An open evaluation platform for generative models

D Jiang, M Ku, T Li, Y Ni, S Sun… - Advances in Neural …, 2025 - proceedings.neurips.cc

Generative AI has made remarkable strides to revolutionize fields such as image and video
generation. These advancements are driven by innovative algorithms, architecture, and …

Uložit Citovat Počet citací tohoto článku: 12 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Uložit Citovat Počet citací tohoto článku: 37 Související články Všechny verze (počet: 2) Zobrazit jako HTML

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Lavie: High-quality video generation with cascaded latent diffusion models

PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Photorealistic video generation with diffusion models

Photomaker: Customizing realistic human photos via stacked id embedding

Emu3: Next-token prediction is all you need

Show-o: One single transformer to unify multimodal understanding and generation

Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Textdiffuser-2: Unleashing the power of language models for text rendering

Genai arena: An open evaluation platform for generative models

Vila-u: a unified foundation model integrating visual understanding and generation