Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L ** - arxiv preprint arxiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

[PDF][PDF] A survey of large language models

WX Zhao, K Zhou, J Li, T Tang… - arxiv preprint arxiv …, 2023 - paper-notes.zhjwpku.com
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …

xlstm: Extended long short-term memory

M Beck, K Pöppel, M Spanring, A Auer… - Advances in …, 2025 - proceedings.neurips.cc
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee… - Proceedings of the …, 2024 - aclanthology.org
Pretraining data design is critically under-documented and often guided by empirically
unsupported intuitions. We pretrain models on data curated (1) at different collection …

Redpajama: an open dataset for training large language models

M Weber, D Fu, Q Anthony, Y Oren… - Advances in …, 2025 - proceedings.neurips.cc
Large language models are increasingly becoming a cornerstone technology in artificial
intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset …

Openmoe: An early effort on open mixture-of-experts language models

F Xue, Z Zheng, Y Fu, J Ni, Z Zheng, W Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
To help the open-source community have a better understanding of Mixture-of-Experts
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …

D-cpt law: Domain-specific continual pre-training scaling law for large language models

H Que, J Liu, G Zhang, C Zhang, X Qu… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely
used to expand the model's fundamental understanding of specific downstream domains …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Openelm: An efficient language model family with open training and inference framework

S Mehta, MH Sekhavat, Q Cao, M Horton… - Workshop on Efficient …, 2024 - openreview.net
The reproducibility and transparency of large language models are crucial for advancing
open research, ensuring the trustworthiness of results, and enabling investigations into data …