A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
On-device language models: A comprehensive review
The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …
applications, and running LLMs on edge devices has become increasingly attractive for …
Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
Language models scale reliably with over-training and on downstream tasks
Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …
performance of large models using cheaper, small-scale experiments. However, there …
Consent in crisis: The rapid decline of the ai data commons
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
Leave no context behind: Efficient infinite context transformers with infini-attention
This work introduces an efficient method to scale Transformer-based Large Language
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …
Generative language models exhibit social identity biases
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
The impressive capabilities of large language models (LLMs) have sparked debate over
whether these models genuinely generalize to unseen tasks or predominantly rely on …
whether these models genuinely generalize to unseen tasks or predominantly rely on …
Position: Key claims in llm research have a long tail of footnotes
Much of the recent discourse within the ML community has been centered around Large
Language Models (LLMs), their functionality and potential--yet not only do we not have a …
Language Models (LLMs), their functionality and potential--yet not only do we not have a …
Redpajama: an open dataset for training large language models
Large language models are increasingly becoming a cornerstone technology in artificial
intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset …
intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset …