Task-adaptive pretrained language models via clustered-importance sampling

D Grangier, S Fan, S Seto, P Ablin - arxiv preprint arxiv:2410.03735, 2024 - arxiv.org
Specialist language models (LMs) focus on a specific task or domain on which they often
outperform generalist LMs of the same size. However, the specialist data needed to pretrain …

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

J Hayase, A Liu, Y Choi, S Oh, NA Smith - arxiv preprint arxiv:2407.16607, 2024 - arxiv.org
The pretraining data of today's strongest language models is opaque; in particular, little is
known about the proportions of various domains or languages represented. In this work, we …

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

J Kazdan, R Schaeffer, A Dey, M Gerstgrasser… - arxiv preprint arxiv …, 2024 - arxiv.org
The increasing presence of AI-generated content on the internet raises a critical question:
What happens when generative machine learning models are pretrained on web-scale …

Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

J Hayase, A Liu, Y Choi, S Oh… - The Thirty-eighth Annual …, 2024 - openreview.net
The pretraining data of today's strongest language models remains opaque, even when their
parameters are open-sourced. In particular, little is known about the proportions of different …

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

W Held, B Paranjape, PS Koura, M Lewis… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models improve with increasing amounts of high-quality training data.
However, leveraging larger datasets requires balancing quality, quantity, and diversity …

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

A Havrilla, A Dai, L O'Mahony, K Oostermeijer… - arxiv preprint arxiv …, 2024 - arxiv.org
Synthetic data generation with Large Language Models is a promising paradigm for
augmenting natural data over a nearly infinite range of tasks. Given this variety, direct …

Accumulating Data Avoids Model Collapse

J Kazdan, A Dey, R Schaeffer, M Gerstgrasser… - NeurIPS 2024 Workshop … - openreview.net
The increasing prevalence of AI-generated content on the internet raises a critical and timely
question: What happens when generative machine learning models are pretrained on web …