The roles of neural networks in language acquisition

E Portelance, M Jasbi - Language and Linguistics Compass, 2024 - Wiley Online Library
How can modern neural networks like language models be useful to the field of language
acquisition, and more broadly cognitive science, if they are not a priori designed to be …

Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms

L Shi, C Ma, W Liang, W Ma, S Vosoughi - arxiv preprint arxiv:2406.07791, 2024 - arxiv.org
LLM-as-a-Judge presents a promising alternative to human evaluators across various tasks,
but inherent biases, especially position bias-a tendency to favor solutions based on their …

Codexgraph: Bridging large language models and code repositories via code graph databases

X Liu, B Lan, Z Hu, Y Liu, Z Zhang, F Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and
MBPP, but struggle with handling entire code repositories. This challenge has prompted …

PyBench: Evaluating LLM Agent on various real-world coding tasks

Y Zhang, Y Pan, Y Wang, J Cai - arxiv preprint arxiv:2407.16732, 2024 - arxiv.org
The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-
world coding tasks, such as data analysis and image editing. However, existing benchmarks …

Theagentcompany: benchmarking llm agents on consequential real world tasks

FF Xu, Y Song, B Li, Y Tang, K Jain, M Bao… - arxiv preprint arxiv …, 2024 - arxiv.org
We interact with computers on an everyday basis, be it in everyday life or work, and many
aspects of work can be done entirely with access to a computer and the Internet. At the same …

Effective Large Language Model Debugging with Best-first Tree Search

J Song, J Raiman, B Catanzaro - arxiv preprint arxiv:2407.19055, 2024 - arxiv.org
Large Language Models (LLMs) show promise in code generation tasks. However, their
code-writing abilities are often limited in scope: while they can successfully implement …

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

J Zheng, B Cao, Z Ma, R Pan, H Lin, Y Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, researchers have proposed numerous benchmarks to evaluate the
impressive coding capabilities of large language models (LLMs). However, current …

Commit0: Library Generation from Scratch

W Zhao, N Jiang, C Lee, JT Chiu, C Cardie… - arxiv preprint arxiv …, 2024 - arxiv.org
With the goal of benchmarking generative systems beyond expert software development
ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from …

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

W Wang, C Yang, Z Wang, Y Huang, Z Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
Testing plays a crucial role in the software development cycle, enabling the detection of
bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers …

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

L Zhang, J Wang, S He, C Zhang, Y Kang, B Li… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models have advanced automated software development, however, it
remains a challenge to correctly infer dependencies, namely, identifying the internal …