A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

[PDF][PDF] A survey of large language models

WX Zhao, K Zhou, J Li, T Tang… - arxiv preprint arxiv …, 2023 - paper-notes.zhjwpku.com
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …

Less: Selecting influential data for targeted instruction tuning

M **a, S Malladi, S Gururangan, S Arora… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The quantization model of neural scaling

E Michaud, Z Liu, U Girit… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract We propose the Quantization Model of neural scaling laws, explaining both the
observed power law dropoff of loss with model and data size, and also the sudden …

Not all tokens are what you need for pretraining

Z Lin, Z Gou, Y Gong, X Liu, R Xu… - Advances in …, 2025 - proceedings.neurips.cc
Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that''Not all tokens in a …

Rho-1: Not all tokens are what you need

Z Lin, Z Gou, Y Gong, X Liu, Y Shen, R Xu, C Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that" 9l training". Our …

Datacomp-lm: In search of the next generation of training sets for language models

J Li, A Fang, G Smyrnis, M Ivgi, M Jordan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset
experiments with the goal of improving language models. As part of DCLM, we provide a …

A tale of tails: Model collapse as a change of scaling laws

E Dohmatob, Y Feng, P Yang, F Charton… - arxiv preprint arxiv …, 2024 - arxiv.org
As AI model size grows, neural scaling laws have become a crucial tool to predict the
improvements of large models when increasing capacity and the size of original (human or …

Dsdm: Model-aware dataset selection with datamodels

L Engstrom, A Feldmann, A Madry - arxiv preprint arxiv:2401.12926, 2024 - arxiv.org
When selecting data for training large-scale models, standard practice is to filter for
examples that match human notions of data quality. Such filtering yields qualitatively clean …