Doremi: Optimizing data mixtures speeds up language model pretraining

SM **e, H Pham, X Dong, N Du, H Liu… - Advances in …, 2023‏ - proceedings.neurips.cc
The mixture proportions of pretraining data domains (eg, Wikipedia, books, web text) greatly
affect language model (LM) performance. In this paper, we propose Domain Reweighting …

Data selection for language models via importance resampling

SM **e, S Santurkar, T Ma… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Selecting a suitable pretraining dataset is crucial for both general-domain (eg, GPT-3) and
domain-specific (eg, Codex) language models (LMs). We formalize this problem as selecting …

Less: Selecting influential data for targeted instruction tuning

M **a, S Malladi, S Gururangan, S Arora… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …

Data-efficient Fine-tuning for LLM-based Recommendation

X Lin, W Wang, Y Li, S Yang, F Feng, Y Wei… - Proceedings of the 47th …, 2024‏ - dl.acm.org
Leveraging Large Language Models (LLMs) for recommendation has recently garnered
considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the …

Dataset quantization

D Zhou, K Wang, J Gu, X Peng, D Lian… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
State-of-the-art deep neural networks are trained with large amounts (millions or even
billions) of data. The expensive computation and memory costs make it difficult to train them …

Deepcore: A comprehensive library for coreset selection in deep learning

C Guo, B Zhao, Y Bai - International Conference on Database and Expert …, 2022‏ - Springer
Coreset selection, which aims to select a subset of the most informative training samples, is
a long-standing learning problem that can benefit many downstream tasks such as data …

Quality not quantity: On the interaction between dataset design and robustness of clip

T Nguyen, G Ilharco, M Wortsman… - Advances in Neural …, 2022‏ - proceedings.neurips.cc
Web-crawled datasets have enabled remarkable generalization capabilities in recent image-
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …

Data distillation: A survey

N Sachdeva, J McAuley - arxiv preprint arxiv:2301.04272, 2023‏ - arxiv.org
The popularity of deep learning has led to the curation of a vast number of massive and
multifarious datasets. Despite having close-to-human performance on individual tasks …

Gcr: Gradient coreset based replay buffer selection for continual learning

R Tiwari, K Killamsetty, R Iyer… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Continual learning (CL) aims to develop techniques by which a single model adapts to an
increasing number of tasks encountered sequentially, thereby potentially leveraging …

Condensing graphs via one-step gradient matching

W **, X Tang, H Jiang, Z Li, D Zhang, J Tang… - Proceedings of the 28th …, 2022‏ - dl.acm.org
As training deep learning models on large dataset takes a lot of time and resources, it is
desired to construct a small synthetic dataset with which we can train deep learning models …