Instancediffusion: Instance-level control for image generation

X Wang, T Darrell, SS Rambhatla… - Proceedings of the …, 2024 - openaccess.thecvf.com
Text-to-image diffusion models produce high quality images but do not offer control over
individual instances in the image. We introduce InstanceDiffusion that adds precise instance …

Grounded text-to-image synthesis with attention refocusing

Q Phung, S Ge, JB Huang - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Driven by the scalable diffusion models trained on large-scale datasets text-to-image
synthesis methods have shown compelling results. However these models still fail to …

Direct-a-video: Customized video generation with user-directed camera movement and object motion

S Yang, L Hou, H Huang, C Ma, P Wan… - ACM SIGGRAPH 2024 …, 2024 - dl.acm.org
Recent text-to-video diffusion models have achieved impressive progress. In practice, users
often desire the ability to control object motion and camera movement independently for …

Comat: Aligning text-to-image diffusion model with image-to-text concept matching

D Jiang, G Song, X Wu, R Zhang… - Advances in …, 2025 - proceedings.neurips.cc
Diffusion models have demonstrated great success in the field of text-to-image generation.
However, alleviating the misalignment between the text prompts and images is still …

Be yourself: Bounded attention for multi-subject text-to-image generation

O Dahary, O Patashnik, K Aberman… - European Conference on …, 2024 - Springer
Text-to-image diffusion models have an unprecedented ability to generate diverse and high-
quality images. However, they often struggle to faithfully capture the intended semantics of …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang… - Advances in …, 2025 - proceedings.neurips.cc
In this work, we propose a training-free method to inject visual prompts into Multimodal
Large Language Models (MLLMs) through learnable latent variable optimization. We …

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K Sun, K Huang, X Liu, Y Wu, Z Xu, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …

Place: Adaptive layout-semantic fusion for semantic image synthesis

Z Lv, Y Wei, W Zuo, KYK Wong - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Recent advancements in large-scale pre-trained text-to-image models have led to
remarkable progress in semantic image synthesis. Nevertheless synthesizing high-quality …

Neural assets: 3d-aware multi-object scene synthesis with image diffusion models

Z Wu, Y Rubanova, R Kabra… - Advances in …, 2025 - proceedings.neurips.cc
We address the problem of multi-object 3D pose control in image diffusion models. Instead
of conditioning on a sequence of text tokens, we propose to use a set of per-object …

Controllable generation with text-to-image diffusion models: A survey

P Cao, F Zhou, Q Song, L Yang - arxiv preprint arxiv:2403.04279, 2024 - arxiv.org
In the rapidly advancing realm of visual generation, diffusion models have revolutionized the
landscape, marking a significant shift in capabilities with their impressive text-guided …