Google Académico

E Portelance, M Jasbi - Language and Linguistics Compass, 2024 - Wiley Online Library

How can modern neural networks like language models be useful to the field of language
acquisition, and more broadly cognitive science, if they are not a priori designed to be …

Guardar Citar Citado por 7 Artículos relacionados Las 5 versiones

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms

L Shi, C Ma, W Liang, W Ma, S Vosoughi - arxiv preprint arxiv:2406.07791, 2024 - arxiv.org

LLM-as-a-Judge presents a promising alternative to human evaluators across various tasks,
but inherent biases, especially position bias-a tendency to favor solutions based on their …

Guardar Citar Citado por 16 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Codexgraph: Bridging large language models and code repositories via code graph databases

X Liu, B Lan, Z Hu, Y Liu, Z Zhang, F Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and
MBPP, but struggle with handling entire code repositories. This challenge has prompted …

Guardar Citar Citado por 10 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

PyBench: Evaluating LLM Agent on various real-world coding tasks

Y Zhang, Y Pan, Y Wang, J Cai - arxiv preprint arxiv:2407.16732, 2024 - arxiv.org

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-
world coding tasks, such as data analysis and image editing. However, existing benchmarks …

Guardar Citar Citado por 5 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Theagentcompany: benchmarking llm agents on consequential real world tasks

FF Xu, Y Song, B Li, Y Tang, K Jain, M Bao… - arxiv preprint arxiv …, 2024 - arxiv.org

We interact with computers on an everyday basis, be it in everyday life or work, and many
aspects of work can be done entirely with access to a computer and the Internet. At the same …

Guardar Citar Citado por 3 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Effective Large Language Model Debugging with Best-first Tree Search

J Song, J Raiman, B Catanzaro - arxiv preprint arxiv:2407.19055, 2024 - arxiv.org

Large Language Models (LLMs) show promise in code generation tasks. However, their
code-writing abilities are often limited in scope: while they can successfully implement …

Guardar Citar Citado por 1 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

J Zheng, B Cao, Z Ma, R Pan, H Lin, Y Lu… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, researchers have proposed numerous benchmarks to evaluate the
impressive coding capabilities of large language models (LLMs). However, current …

Guardar Citar Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Commit0: Library Generation from Scratch

W Zhao, N Jiang, C Lee, JT Chiu, C Cardie… - arxiv preprint arxiv …, 2024 - arxiv.org

With the goal of benchmarking generative systems beyond expert software development
ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from …

Guardar Citar Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

W Wang, C Yang, Z Wang, Y Huang, Z Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

Testing plays a crucial role in the software development cycle, enabling the detection of
bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers …

Guardar Citar Citado por 10 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

L Zhang, J Wang, S He, C Zhang, Y Kang, B Li… - arxiv preprint arxiv …, 2025 - arxiv.org

Large Language Models have advanced automated software development, however, it
remains a challenge to correctly infer dependencies, namely, identifying the internal …

Guardar Citar Artículos relacionados Las 2 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Devbench: A comprehensive benchmark for software development

The roles of neural networks in language acquisition

Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms

Codexgraph: Bridging large language models and code repositories via code graph databases

PyBench: Evaluating LLM Agent on various real-world coding tasks

Theagentcompany: benchmarking llm agents on consequential real world tasks

Effective Large Language Model Debugging with Best-first Tree Search

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Commit0: Library Generation from Scratch

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale