T2v-compbench: A comprehensive benchmark for compositional text-to-video generation
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …
compose different objects, attributes, actions, and motions into a video remains unexplored …
Self-correcting llm-controlled diffusion models
Text-to-image generation has witnessed significant progress with the advent of diffusion
models. Despite the ability to generate photorealistic images current text-to-image diffusion …
models. Despite the ability to generate photorealistic images current text-to-image diffusion …
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …
text-image pairs, significantly departing from previous methods relying on real data …
The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
In spite of recent advancements in text-to-image generation, limitations persist in handling
complex and imaginative prompts due to the restricted diversity and complexity of training …
complex and imaginative prompts due to the restricted diversity and complexity of training …
Auto cherry-picker: Learning from high-quality generative data driven by language
Diffusion-based models have shown great potential in generating high-quality images with
various layouts, which can benefit downstream perception tasks. However, a fully automatic …
various layouts, which can benefit downstream perception tasks. However, a fully automatic …
Local conditional controlling for text-to-image diffusion models
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent
methods add image-level structure controls, eg, edge and depth maps, to manipulate the …
methods add image-level structure controls, eg, edge and depth maps, to manipulate the …
LLMs Meet Multimodal Generation and Editing: A Survey
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
Build-a-scene: Interactive 3d layout control for diffusion-based image generation
We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive
3D layout control. Layout control has been widely studied to alleviate the shortcomings of …
3D layout control. Layout control has been widely studied to alleviate the shortcomings of …
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation
Despite the impressive advances in text-to-image models, they often struggle to effectively
compose complex scenes with multiple objects, displaying various attributes and …
compose complex scenes with multiple objects, displaying various attributes and …
Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance
State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare
compositions of concepts, eg, objects with unusual attributes. In this paper, we show that the …
compositions of concepts, eg, objects with unusual attributes. In this paper, we show that the …