PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
The most advanced text-to-image (T2I) models require significant training costs (eg, millions
of GPU hours), seriously hindering the fundamental innovation for the AIGC community …
of GPU hours), seriously hindering the fundamental innovation for the AIGC community …
Evaluating text-to-visual generation with image-to-text generation
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …
challenging because of the lack of effective metrics and standardized benchmarks. For …
Vbench: Comprehensive benchmark suite for video generative models
Video generation has witnessed significant advancements yet evaluating these models
remains a challenge. A comprehensive evaluation benchmark for video generation is …
remains a challenge. A comprehensive evaluation benchmark for video generation is …
Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms
Diffusion models have exhibit exceptional performance in text-to-image generation and
editing. However, existing methods often face challenges when handling complex text …
editing. However, existing methods often face challenges when handling complex text …
Revision: Rendering tools enable spatial fidelity in vision-language models
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …
adopted in solutions for several computer vision and multimodal learning tasks. However, it …
Deepfake: definitions, performance metrics and standards, datasets, and a meta-review
Recent advancements in AI, especially deep learning, have contributed to a significant
increase in the creation of new realistic-looking synthetic media (video, image, and audio) …
increase in the creation of new realistic-looking synthetic media (video, image, and audio) …
Unified hallucination detection for multimodal large language models
Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs)
are plagued by the critical issue of hallucination. The reliable detection of such …
are plagued by the critical issue of hallucination. The reliable detection of such …
Ranni: Taming text-to-image diffusion for accurate instruction following
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts
especially those with quantity object-attribute binding and multi-subject descriptions. In this …
especially those with quantity object-attribute binding and multi-subject descriptions. In this …
T2v-compbench: A comprehensive benchmark for compositional text-to-video generation
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …
compose different objects, attributes, actions, and motions into a video remains unexplored …