- Academic Search

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024 - arxiv.org

Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …

Speichern Zitieren Zitiert von: 42 Ähnliche Artikel HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process

T Ye, Z Xu, Y Li, Z Allen-Zhu - The Thirteenth International …, 2024 - openreview.net

Recent advances in language models have demonstrated their capability to solve
mathematical reasoning problems, achieving near-perfect accuracy on grade-school level …

Speichern Zitieren Zitiert von: 20 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Small language models: Survey, measurements, and insights

Z Lu, X Li, D Cai, R Yi, F Liu, X Zhang, ND Lane… - arxiv preprint arxiv …, 2024 - arxiv.org

Small language models (SLMs), despite their widespread adoption in modern smart
devices, have received significantly less academic attention compared to their large …

Speichern Zitieren Zitiert von: 18 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arxiv preprint arxiv …, 2024 - arxiv.org

AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

Speichern Zitieren Zitiert von: 24 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Bright: A realistic and challenging benchmark for reasoning-intensive retrieval

H Su, H Yen, M **a, W Shi, N Muennighoff… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing retrieval benchmarks primarily consist of information-seeking queries (eg,
aggregated questions from search engines) where keyword or semantic-based retrieval is …

Speichern Zitieren Zitiert von: 10 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] rsna.org

Privacy-ensuring open-weights large language models are competitive with closed-weights GPT-4o in extracting chest radiography findings from free-text reports

S Nowak, B Wulff, YC Layer, M Theis, A Isaak, B Salam… - Radiology, 2025 - pubs.rsna.org

Background Large-scale secondary use of clinical databases requires automated tools for
retrospective extraction of structured content from free-text radiology reports. Purpose To …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On leakage of code generation evaluation datasets

A Matton, T Sherborne, D Aumiller… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we consider contamination by code generation test sets, in particular in their
use in modern large language models. We discuss three possible sources of such …

Speichern Zitieren Zitiert von: 14 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org

Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

Speichern Zitieren Zitiert von: 7 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Is your model really a good math reasoner? evaluating mathematical reasoning with checklist

Z Zhou, S Liu, M Ning, W Liu, J Wang, DF Wong… - arxiv preprint arxiv …, 2024 - arxiv.org

Exceptional mathematical reasoning ability is one of the key features that demonstrate the
power of large language models (LLMs). How to comprehensively define and evaluate the …

Speichern Zitieren Zitiert von: 11 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

C Zou, X Guo, R Yang, J Zhang, B Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel Alle 5 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

A careful examination of large language model performance on grade school arithmetic

Livebench: A challenging, contamination-free llm benchmark

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process

Small language models: Survey, measurements, and insights

Open problems in technical ai governance

Bright: A realistic and challenging benchmark for reasoning-intensive retrieval

Privacy-ensuring open-weights large language models are competitive with closed-weights GPT-4o in extracting chest radiography findings from free-text reports

On leakage of code generation evaluation datasets

Eureka: Evaluating and understanding large foundation models

Is your model really a good math reasoner? evaluating mathematical reasoning with checklist

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models