The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey

T Masterman, S Besen, M Sawtell, A Chao - arxiv preprint arxiv …, 2024 - arxiv.org
This survey paper examines the recent advancements in AI agent implementations, with a
focus on their ability to achieve complex goals that require enhanced reasoning, planning …

Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org
Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

Seeing the unseen: advancing generative AI research in radiology

W Kim - Radiology, 2024 - pubs.rsna.org
the researchers studying them. The LLMs may also modify our prompts and their outputs.
While this practice may serve as a guardrail against misuse, it can also have undesirable …

How secure is AI-generated code: a large-scale comparison of large language models

N Tihanyi, T Bisztray, MA Ferrag, R Jain… - Empirical Software …, 2025 - Springer
This study compares state-of-the-art Large Language Models (LLMs) on their tendency to
generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi …

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

S Tu, K Zhu, Y Bai, Z Yao, L Hou, J Li - arxiv preprint arxiv:2406.04197, 2024 - arxiv.org
The advancement of large language models (LLMs) relies on evaluation using public
benchmarks, but data contamination can lead to overestimated performance. Previous …

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

KT Tran, D Dao, MD Nguyen, QV Pham… - arxiv preprint arxiv …, 2025 - arxiv.org
With recent advances in Large Language Models (LLMs), Agentic AI has become
phenomenal in real-world applications, moving toward multiple LLM-based agents to …

A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI

Y Kumar, M Lin, C Paredes, D Li, G Yang… - …, 2024 - search.proquest.com
In a previous paper we defined testFAILS, a set of benchmarks for measuring the efficacy of
Large Language Models in various domains. This paper defines a second-generation …

Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence

N Tihanyi, T Bisztray, RA Dubniczky… - … Conference on Big …, 2024 - ieeexplore.ieee.org
As machine intelligence evolves, the need to test and compare the problem-solving abilities
of different AI models grows. However, current benchmarks are often simplistic, allowing …

Addressing Data Leakage in HumanEval Using Combinatorial Test Design

JS Bradbury, R More - arxiv preprint arxiv:2412.01526, 2024 - arxiv.org
The use of large language models (LLMs) is widespread across many domains, including
Software Engineering, where they have been used to automate tasks such as program …