The roles of neural networks in language acquisition
How can modern neural networks like language models be useful to the field of language
acquisition, and more broadly cognitive science, if they are not a priori designed to be …
acquisition, and more broadly cognitive science, if they are not a priori designed to be …
Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms
LLM-as-a-Judge presents a promising alternative to human evaluators across various tasks,
but inherent biases, especially position bias-a tendency to favor solutions based on their …
but inherent biases, especially position bias-a tendency to favor solutions based on their …
Codexgraph: Bridging large language models and code repositories via code graph databases
Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and
MBPP, but struggle with handling entire code repositories. This challenge has prompted …
MBPP, but struggle with handling entire code repositories. This challenge has prompted …
PyBench: Evaluating LLM Agent on various real-world coding tasks
The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-
world coding tasks, such as data analysis and image editing. However, existing benchmarks …
world coding tasks, such as data analysis and image editing. However, existing benchmarks …
Theagentcompany: benchmarking llm agents on consequential real world tasks
We interact with computers on an everyday basis, be it in everyday life or work, and many
aspects of work can be done entirely with access to a computer and the Internet. At the same …
aspects of work can be done entirely with access to a computer and the Internet. At the same …
Effective Large Language Model Debugging with Best-first Tree Search
Large Language Models (LLMs) show promise in code generation tasks. However, their
code-writing abilities are often limited in scope: while they can successfully implement …
code-writing abilities are often limited in scope: while they can successfully implement …
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
In recent years, researchers have proposed numerous benchmarks to evaluate the
impressive coding capabilities of large language models (LLMs). However, current …
impressive coding capabilities of large language models (LLMs). However, current …
Commit0: Library Generation from Scratch
With the goal of benchmarking generative systems beyond expert software development
ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from …
ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from …
TESTEVAL: Benchmarking Large Language Models for Test Case Generation
Testing plays a crucial role in the software development cycle, enabling the detection of
bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers …
bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers …
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Large Language Models have advanced automated software development, however, it
remains a challenge to correctly infer dependencies, namely, identifying the internal …
remains a challenge to correctly infer dependencies, namely, identifying the internal …