- Academic Search

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey

T Masterman, S Besen, M Sawtell, A Chao - arxiv preprint arxiv …, 2024 - arxiv.org

This survey paper examines the recent advancements in AI agent implementations, with a
focus on their ability to achieve complex goals that require enhanced reasoning, planning …

บันทึก อ้างอิง อ้างโดย29 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org

Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

บันทึก อ้างอิง อ้างโดย7 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] rsna.org

Seeing the unseen: advancing generative AI research in radiology

W Kim - Radiology, 2024 - pubs.rsna.org

the researchers studying them. The LLMs may also modify our prompts and their outputs.
While this practice may serve as a guardrail against misuse, it can also have undesirable …

บันทึก อ้างอิง อ้างโดย8 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How secure is AI-generated code: a large-scale comparison of large language models

N Tihanyi, T Bisztray, MA Ferrag, R Jain… - Empirical Software …, 2025 - Springer

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to
generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi …

บันทึก อ้างอิง อ้างโดย1 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

S Tu, K Zhu, Y Bai, Z Yao, L Hou, J Li - arxiv preprint arxiv:2406.04197, 2024 - arxiv.org

The advancement of large language models (LLMs) relies on evaluation using public
benchmarks, but data contamination can lead to overestimated performance. Previous …

บันทึก อ้างอิง อ้างโดย5 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

KT Tran, D Dao, MD Nguyen, QV Pham… - arxiv preprint arxiv …, 2025 - arxiv.org

With recent advances in Large Language Models (LLMs), Agentic AI has become
phenomenal in real-world applications, moving toward multiple LLM-based agents to …

บันทึก อ้างอิง อ้างโดย4 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI

Y Kumar, M Lin, C Paredes, D Li, G Yang… - …, 2024 - search.proquest.com

In a previous paper we defined testFAILS, a set of benchmarks for measuring the efficacy of
Large Language Models in various domains. This paper defines a second-generation …

บันทึก อ้างอิง บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence

N Tihanyi, T Bisztray, RA Dubniczky… - … Conference on Big …, 2024 - ieeexplore.ieee.org

As machine intelligence evolves, the need to test and compare the problem-solving abilities
of different AI models grows. However, current benchmarks are often simplistic, allowing …

บันทึก อ้างอิง อ้างโดย2 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Addressing Data Leakage in HumanEval Using Combinatorial Test Design

JS Bradbury, R More - arxiv preprint arxiv:2412.01526, 2024 - arxiv.org

The use of large language models (LLMs) is widespread across many domains, including
Software Engineering, where they have been used to automate tasks such as program …

บันทึก อ้างอิง บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey

Eureka: Evaluating and understanding large foundation models

Seeing the unseen: advancing generative AI research in radiology

How secure is AI-generated code: a large-scale comparison of large language models

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI

Dynamic intelligence assessment: Benchmarking llms on the road to agi with a focus on model confidence

Addressing Data Leakage in HumanEval Using Combinatorial Test Design