Large language models for software engineering: A systematic literature review

X Hou, Y Zhao, Y Liu, Z Yang, K Wang, L Li… - ACM Transactions on …, 2024 - dl.acm.org
Large Language Models (LLMs) have significantly impacted numerous domains, including
Software Engineering (SE). Many recent publications have explored LLMs applied to …

A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Doremi: Optimizing data mixtures speeds up language model pretraining

SM **e, H Pham, X Dong, N Du, H Liu… - Advances in …, 2023 - proceedings.neurips.cc
The mixture proportions of pretraining data domains (eg, Wikipedia, books, web text) greatly
affect language model (LM) performance. In this paper, we propose Domain Reweighting …

Data selection for language models via importance resampling

SM **e, S Santurkar, T Ma… - Advances in Neural …, 2023 - proceedings.neurips.cc
Selecting a suitable pretraining dataset is crucial for both general-domain (eg, GPT-3) and
domain-specific (eg, Codex) language models (LMs). We formalize this problem as selecting …

On the dangers of stochastic parrots: Can language models be too big?🦜

EM Bender, T Gebru, A McMillan-Major… - Proceedings of the 2021 …, 2021 - dl.acm.org
The past 3 years of work in NLP have been characterized by the development and
deployment of ever larger language models, especially for English. BERT, its variants, GPT …

A survey on curriculum learning

X Wang, Y Chen, W Zhu - IEEE transactions on pattern analysis …, 2021 - ieeexplore.ieee.org
Curriculum learning (CL) is a training strategy that trains a machine learning model from
easier data to harder data, which imitates the meaningful learning order in human curricula …

Don't stop pretraining: Adapt language models to domains and tasks

S Gururangan, A Marasović, S Swayamdipta… - arxiv preprint arxiv …, 2020 - arxiv.org
Language models pretrained on text from a wide variety of sources form the foundation of
today's NLP. In light of the success of these broad-coverage models, we investigate whether …

Skill-it! a data-driven skills framework for understanding and training language models

M Chen, N Roberts, K Bhatia, J Wang… - Advances in …, 2023 - proceedings.neurips.cc
The quality of training data impacts the performance of pre-trained large language models
(LMs). Given a fixed budget of tokens, we study how to best select data that leads to good …

Survey of low-resource machine translation

B Haddow, R Bawden, AVM Barone, J Helcl… - Computational …, 2022 - direct.mit.edu
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …

Neural unsupervised domain adaptation in NLP---a survey

A Ramponi, B Plank - arxiv preprint arxiv:2006.00632, 2020 - arxiv.org
Deep neural networks excel at learning from labeled data and achieve state-of-the-art
resultson a wide array of Natural Language Processing tasks. In contrast, learning from …