T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K Sun, K Huang, X Liu, Y Wu, Z Xu, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …

Self-correcting llm-controlled diffusion models

TH Wu, L Lian, JE Gonzalez, B Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Text-to-image generation has witnessed significant progress with the advent of diffusion
models. Despite the ability to generate photorealistic images current text-to-image diffusion …

SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

HAAK Hammoud, H Itani, F Pizzati, P Torr… - arxiv preprint arxiv …, 2024 - arxiv.org
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Y Yao, CF Hsu, JH Lin, H **e, T Lin, YN Huang… - … on Computer Vision, 2024 - Springer
In spite of recent advancements in text-to-image generation, limitations persist in handling
complex and imaginative prompts due to the restricted diversity and complexity of training …

Auto cherry-picker: Learning from high-quality generative data driven by language

Y Chen, X Li, Y Li, Y Zeng, J Wu, X Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion-based models have shown great potential in generating high-quality images with
various layouts, which can benefit downstream perception tasks. However, a fully automatic …

Local conditional controlling for text-to-image diffusion models

Y Zhao, L Peng, Y Yang, Z Luo, H Li, Y Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent
methods add image-level structure controls, eg, edge and depth maps, to manipulate the …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

Build-a-scene: Interactive 3d layout control for diffusion-based image generation

A Eldesokey, P Wonka - arxiv preprint arxiv:2408.14819, 2024 - arxiv.org
We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive
3D layout control. Layout control has been widely studied to alleviate the shortcomings of …

T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation

K Huang, C Duan, K Sun, E **e, Z Li… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
Despite the impressive advances in text-to-image models, they often struggle to effectively
compose complex scenes with multiple objects, displaying various attributes and …

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

D Park, S Kim, T Moon, M Kim, K Lee, J Cho - arxiv preprint arxiv …, 2024 - arxiv.org
State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare
compositions of concepts, eg, objects with unusual attributes. In this paper, we show that the …