Livebench: A challenging, contamination-free llm benchmark

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024 - arxiv.org
Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process

T Ye, Z Xu, Y Li, Z Allen-Zhu - The Thirteenth International …, 2024 - openreview.net
Recent advances in language models have demonstrated their capability to solve
mathematical reasoning problems, achieving near-perfect accuracy on grade-school level …

Small language models: Survey, measurements, and insights

Z Lu, X Li, D Cai, R Yi, F Liu, X Zhang, ND Lane… - arxiv preprint arxiv …, 2024 - arxiv.org
Small language models (SLMs), despite their widespread adoption in modern smart
devices, have received significantly less academic attention compared to their large …

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arxiv preprint arxiv …, 2024 - arxiv.org
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

Bright: A realistic and challenging benchmark for reasoning-intensive retrieval

H Su, H Yen, M **a, W Shi, N Muennighoff… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing retrieval benchmarks primarily consist of information-seeking queries (eg,
aggregated questions from search engines) where keyword or semantic-based retrieval is …

Privacy-ensuring open-weights large language models are competitive with closed-weights GPT-4o in extracting chest radiography findings from free-text reports

S Nowak, B Wulff, YC Layer, M Theis, A Isaak, B Salam… - Radiology, 2025 - pubs.rsna.org
Background Large-scale secondary use of clinical databases requires automated tools for
retrospective extraction of structured content from free-text radiology reports. Purpose To …

On leakage of code generation evaluation datasets

A Matton, T Sherborne, D Aumiller… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we consider contamination by code generation test sets, in particular in their
use in modern large language models. We discuss three possible sources of such …

Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org
Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

Is your model really a good math reasoner? evaluating mathematical reasoning with checklist

Z Zhou, S Liu, M Ning, W Liu, J Wang, DF Wong… - arxiv preprint arxiv …, 2024 - arxiv.org
Exceptional mathematical reasoning ability is one of the key features that demonstrate the
power of large language models (LLMs). How to comprehensively define and evaluate the …

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

C Zou, X Guo, R Yang, J Zhang, B Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …