A survey of machine learning for big code and naturalness

M Allamanis, ET Barr, P Devanbu… - ACM Computing Surveys …, 2018 - dl.acm.org
Research at the intersection of machine learning, programming languages, and software
engineering has recently taken important steps in proposing learnable probabilistic models …

A survey on deep graph generation: Methods and applications

Y Zhu, Y Du, Y Wang, Y Xu, J Zhang… - Learning on Graphs …, 2022 - proceedings.mlr.press
Graphs are ubiquitous in encoding relational information of real-world objects in many
domains. Graph generation, whose purpose is to generate new graphs from a distribution …

Codexglue: A machine learning benchmark dataset for code understanding and generation

S Lu, D Guo, S Ren, J Huang, A Svyatkovskiy… - arxiv preprint arxiv …, 2021 - arxiv.org
Benchmark datasets have a significant impact on accelerating research in programming
language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster …

Longcoder: A long-range pre-trained language model for code completion

D Guo, C Xu, N Duan, J Yin… - … Conference on Machine …, 2023 - proceedings.mlr.press
In this paper, we introduce a new task for code completion that focuses on handling long
code input and propose a sparse Transformer model, called LongCoder, to address this …

Repobench: Benchmarking repository-level code auto-completion systems

T Liu, C Xu, J McAuley - arxiv preprint arxiv:2306.03091, 2023 - arxiv.org
Large Language Models (LLMs) have greatly advanced code auto-completion systems, with
a potential for substantial productivity enhancements for developers. However, current …

No need to lift a finger anymore? assessing the quality of code generation by chatgpt

Z Liu, Y Tang, X Luo, Y Zhou… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have demonstrated impressive capabilities across various
natural language processing (NLP) tasks, such as machine translation, question answering …

code2vec: Learning distributed representations of code

U Alon, M Zilberstein, O Levy, E Yahav - Proceedings of the ACM on …, 2019 - dl.acm.org
We present a neural model for representing snippets of code as continuous distributed
vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed …

Is github's copilot as bad as humans at introducing vulnerabilities in code?

O Asare, M Nagappan, N Asokan - Empirical Software Engineering, 2023 - Springer
Several advances in deep learning have been successfully applied to the software
development process. Of recent interest is the use of neural language models to build tools …

code2seq: Generating sequences from structured representations of code

U Alon, S Brody, O Levy, E Yahav - arxiv preprint arxiv:1808.01400, 2018 - arxiv.org
The ability to generate natural language sequences from source code snippets has a variety
of applications such as code summarization, documentation, and retrieval. Sequence-to …

Unifying the perspectives of nlp and software engineering: A survey on language models for code

Z Zhang, C Chen, B Liu, C Liao, Z Gong, H Yu… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work we systematically review the recent advancements in software engineering with
language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 …