A survey of machine learning for big code and naturalness

M Allamanis, ET Barr, P Devanbu… - ACM Computing Surveys …, 2018 - dl.acm.org
Research at the intersection of machine learning, programming languages, and software
engineering has recently taken important steps in proposing learnable probabilistic models …

A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges

M Zakeri-Nasrabadi, S Parsa, M Ramezani… - Journal of Systems and …, 2023 - Elsevier
Measuring and evaluating source code similarity is a fundamental software engineering
activity that embraces a broad range of applications, including but not limited to code …

Scaling data-constrained language models

N Muennighoff, A Rush, B Barak… - Advances in …, 2023 - proceedings.neurips.cc
The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …

Coder reviewer reranking for code generation

T Zhang, T Yu, T Hashimoto, M Lewis… - International …, 2023 - proceedings.mlr.press
Sampling diverse programs from a code language model and reranking with model
likelihood is a popular method for code generation but it is prone to preferring degenerate …

Wilds: A benchmark of in-the-wild distribution shifts

PW Koh, S Sagawa, H Marklund… - International …, 2021 - proceedings.mlr.press
Distribution shifts—where the training distribution differs from the test distribution—can
substantially degrade the accuracy of machine learning (ML) systems deployed in the wild …

A novel neural source code representation based on abstract syntax tree

J Zhang, X Wang, H Zhang, H Sun… - 2019 IEEE/ACM 41st …, 2019 - ieeexplore.ieee.org
Exploiting machine learning techniques for analyzing programs has attracted much
attention. One key problem is how to represent code fragments well for follow-up analysis …

Learning and evaluating contextual embedding of source code

A Kanade, P Maniatis… - … on machine learning, 2020 - proceedings.mlr.press
Recent research has achieved impressive results on understanding and improving source
code by building up on machine-learning techniques developed for natural languages. A …

Natgen: generative pre-training by “naturalizing” source code

S Chakraborty, T Ahmed, Y Ding, PT Devanbu… - Proceedings of the 30th …, 2022 - dl.acm.org
Pre-trained Generative Language models (eg, PLBART, CodeT5, SPT-Code) for source
code yielded strong results on several tasks in the past few years, including code generation …

code2vec: Learning distributed representations of code

U Alon, M Zilberstein, O Levy, E Yahav - Proceedings of the ACM on …, 2019 - dl.acm.org
We present a neural model for representing snippets of code as continuous distributed
vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed …

code2seq: Generating sequences from structured representations of code

U Alon, S Brody, O Levy, E Yahav - arxiv preprint arxiv:1808.01400, 2018 - arxiv.org
The ability to generate natural language sequences from source code snippets has a variety
of applications such as code summarization, documentation, and retrieval. Sequence-to …