PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

J Chen, J Yu, C Ge, L Yao, E **e, Y Wu, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
The most advanced text-to-image (T2I) models require significant training costs (eg, millions
of GPU hours), seriously hindering the fundamental innovation for the AIGC community …

Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Vbench: Comprehensive benchmark suite for video generative models

Z Huang, Y He, J Yu, F Zhang, C Si… - Proceedings of the …, 2024 - openaccess.thecvf.com
Video generation has witnessed significant advancements yet evaluating these models
remains a challenge. A comprehensive evaluation benchmark for video generation is …

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

L Yang, Z Yu, C Meng, M Xu, S Ermon… - Forty-first International …, 2024 - openreview.net
Diffusion models have exhibit exceptional performance in text-to-image generation and
editing. However, existing methods often face challenges when handling complex text …

Revision: Rendering tools enable spatial fidelity in vision-language models

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2024 - Springer
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …

Deepfake: definitions, performance metrics and standards, datasets, and a meta-review

E Altuncu, VNL Franqueira, S Li - Frontiers in Big Data, 2024 - frontiersin.org
Recent advancements in AI, especially deep learning, have contributed to a significant
increase in the creation of new realistic-looking synthetic media (video, image, and audio) …

Unified hallucination detection for multimodal large language models

X Chen, C Wang, Y Xue, N Zhang, X Yang, Q Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs)
are plagued by the critical issue of hallucination. The reliable detection of such …

Ranni: Taming text-to-image diffusion for accurate instruction following

Y Feng, B Gong, D Chen, Y Shen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts
especially those with quantity object-attribute binding and multi-subject descriptions. In this …

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K Sun, K Huang, X Liu, Y Wu, Z Xu, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …