CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
Pretraining robust vision or multimodal foundation models (eg, CLIP) relies on large-scale
datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …
datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous …
Economics of Sourcing Human Data
Progress in AI has relied on human-generated data, from annotator marketplaces to the
wider Internet. However, the widespread use of large language models now threatens the …
wider Internet. However, the widespread use of large language models now threatens the …
Evaluating Text-to-Image Diffusion Models for Texturing Synthetic Data
Building generic robotic manipulation systems often requires large amounts of real-world
data, which can be dificult to collect. Synthetic data generation offers a promising alternative …
data, which can be dificult to collect. Synthetic data generation offers a promising alternative …