[PDF][PDF] A survey of large language models

WX Zhao, K Zhou, J Li, T Tang… - arxiv preprint arxiv …, 2023 - paper-notes.zhjwpku.com
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …

Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic

S Goyal, P Maini, ZC Lipton… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully
selected subsets of massive web scrapes. For instance the LAION public dataset retained …

Datacomp-lm: In search of the next generation of training sets for language models

J Li, A Fang, G Smyrnis, M Ivgi, M Jordan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset
experiments with the goal of improving language models. As part of DCLM, we provide a …

Scaling synthetic data creation with 1,000,000,000 personas

T Ge, X Chan, X Wang, D Yu, H Mi, D Yu - arxiv preprint arxiv:2406.20094, 2024 - arxiv.org
We propose a novel persona-driven data synthesis methodology that leverages various
perspectives within a large language model (LLM) to create diverse synthetic data. To fully …

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, M Farré, R Basri… - arxiv preprint arxiv …, 2024 - arxiv.org
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

Zamba: A compact 7b ssm hybrid model

P Glorioso, Q Anthony, Y Tokpanov… - arxiv preprint arxiv …, 2024 - arxiv.org
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …

Reverse training to nurse the reversal curse

O Golovneva, Z Allen-Zhu, J Weston… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have a surprising failure: when trained on" A has a feature
B", they do not generalize to" B is a feature of A", which is termed the Reversal Curse. Even …

Instruction pre-training: Language models are supervised multitask learners

D Cheng, Y Gu, S Huang, J Bi, M Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Unsupervised multitask pre-training has been the critical method behind the recent success
of language models (LMs). However, supervised multitask learning still holds significant …

A survey on data synthesis and augmentation for large language models

K Wang, J Zhu, M Ren, Z Liu, S Li, Z Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The success of Large Language Models (LLMs) is inherently linked to the availability of vast,
diverse, and high-quality data for training and evaluation. However, the growth rate of high …

Mates: Model-aware data selection for efficient pretraining with data influence models

Z Yu, S Das, C **ong - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …