Task-adaptive pretrained language models via clustered-importance sampling
Specialist language models (LMs) focus on a specific task or domain on which they often
outperform generalist LMs of the same size. However, the specialist data needed to pretrain …
outperform generalist LMs of the same size. However, the specialist data needed to pretrain …
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
The pretraining data of today's strongest language models is opaque; in particular, little is
known about the proportions of various domains or languages represented. In this work, we …
known about the proportions of various domains or languages represented. In this work, we …
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
The increasing presence of AI-generated content on the internet raises a critical question:
What happens when generative machine learning models are pretrained on web-scale …
What happens when generative machine learning models are pretrained on web-scale …
Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions
The pretraining data of today's strongest language models remains opaque, even when their
parameters are open-sourced. In particular, little is known about the proportions of different …
parameters are open-sourced. In particular, little is known about the proportions of different …
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility
Large Language Models improve with increasing amounts of high-quality training data.
However, leveraging larger datasets requires balancing quality, quantity, and diversity …
However, leveraging larger datasets requires balancing quality, quantity, and diversity …
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
Synthetic data generation with Large Language Models is a promising paradigm for
augmenting natural data over a nearly infinite range of tasks. Given this variety, direct …
augmenting natural data over a nearly infinite range of tasks. Given this variety, direct …
Accumulating Data Avoids Model Collapse
The increasing prevalence of AI-generated content on the internet raises a critical and timely
question: What happens when generative machine learning models are pretrained on web …
question: What happens when generative machine learning models are pretrained on web …