CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Q Cao, M Najibi, S Mehta - arxiv preprint arxiv:2410.11963, 2024‏ - arxiv.org
Pretraining robust vision or multimodal foundation models (eg, CLIP) relies on large-scale
datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …

Economics of Sourcing Human Data

S Santy, P Bhattacharya, MH Ribeiro, K Allen… - arxiv preprint arxiv …, 2025‏ - arxiv.org
Progress in AI has relied on human-generated data, from annotator marketplaces to the
wider Internet. However, the widespread use of large language models now threatens the …

Evaluating Text-to-Image Diffusion Models for Texturing Synthetic Data

T Lips - arxiv preprint arxiv:2411.10164, 2024‏ - arxiv.org
Building generic robotic manipulation systems often requires large amounts of real-world
data, which can be dificult to collect. Synthetic data generation offers a promising alternative …