Large language models for software engineering: A systematic literature review

X Hou, Y Zhao, Y Liu, Z Yang, K Wang, L Li… - ACM Transactions on …, 2024 - dl.acm.org
Large Language Models (LLMs) have significantly impacted numerous domains, including
Software Engineering (SE). Many recent publications have explored LLMs applied to …

A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges

M Zakeri-Nasrabadi, S Parsa, M Ramezani… - Journal of Systems and …, 2023 - Elsevier
Measuring and evaluating source code similarity is a fundamental software engineering
activity that embraces a broad range of applications, including but not limited to code …

The stack: 3 tb of permissively licensed source code

D Kocetkov, R Li, LB Allal, J Li, C Mou… - arxiv preprint arxiv …, 2022 - arxiv.org
Large Language Models (LLMs) play an ever-increasing role in the field of Artificial
Intelligence (AI)--not only for natural language processing but also for code understanding …

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Y Wang, W Wang, S Joty, SCH Hoi - arxiv preprint arxiv:2109.00859, 2021 - arxiv.org
Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently
shown to transfer well to Programming Languages (PL) and largely benefit a broad set of …

SantaCoder: don't reach for the stars!

LB Allal, R Li, D Kocetkov, C Mou, C Akiki… - arxiv preprint arxiv …, 2023 - arxiv.org
The BigCode project is an open-scientific collaboration working on the responsible
development of large language models for code. This tech report describes the progress of …

Efficient training of language models to fill in the middle

M Bavarian, H Jun, N Tezak, J Schulman… - arxiv preprint arxiv …, 2022 - arxiv.org
We show that autoregressive language models can learn to infill text after we apply a
straightforward transformation to the dataset, which simply moves a span of text from the …

Unsupervised translation of programming languages

B Roziere, MA Lachaux… - Advances in neural …, 2020 - proceedings.neurips.cc
A transcompiler, also known as source-to-source translator, is a system that converts source
code from a high-level programming language (such as C++ or Python) to another …

Natgen: generative pre-training by “naturalizing” source code

S Chakraborty, T Ahmed, Y Ding, PT Devanbu… - Proceedings of the 30th …, 2022 - dl.acm.org
Pre-trained Generative Language models (eg, PLBART, CodeT5, SPT-Code) for source
code yielded strong results on several tasks in the past few years, including code generation …

An empirical comparison of pre-trained models of source code

C Niu, C Li, V Ng, D Chen, J Ge… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
While a large number of pre-trained models of source code have been successfully
developed and applied to a variety of software engineering (SE) tasks in recent years, our …

Natural language to code translation with execution

F Shi, D Fried, M Ghazvininejad, L Zettlemoyer… - arxiv preprint arxiv …, 2022 - arxiv.org
Generative models of code, pretrained on large corpora of programs, have shown great
success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et …