Livebench: A challenging, contamination-free llm benchmark
Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …
Physics of language models: Part 2.1, grade-school math and the hidden reasoning process
Recent advances in language models have demonstrated their capability to solve
mathematical reasoning problems, achieving near-perfect accuracy on grade-school level …
mathematical reasoning problems, achieving near-perfect accuracy on grade-school level …
Small language models: Survey, measurements, and insights
Small language models (SLMs), despite their widespread adoption in modern smart
devices, have received significantly less academic attention compared to their large …
devices, have received significantly less academic attention compared to their large …
Open problems in technical ai governance
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …
they should be navigated. In many cases, the barriers and uncertainties faced are at least …
Bright: A realistic and challenging benchmark for reasoning-intensive retrieval
Existing retrieval benchmarks primarily consist of information-seeking queries (eg,
aggregated questions from search engines) where keyword or semantic-based retrieval is …
aggregated questions from search engines) where keyword or semantic-based retrieval is …
Privacy-ensuring open-weights large language models are competitive with closed-weights GPT-4o in extracting chest radiography findings from free-text reports
Background Large-scale secondary use of clinical databases requires automated tools for
retrospective extraction of structured content from free-text radiology reports. Purpose To …
retrospective extraction of structured content from free-text radiology reports. Purpose To …
On leakage of code generation evaluation datasets
In this paper, we consider contamination by code generation test sets, in particular in their
use in modern large language models. We discuss three possible sources of such …
use in modern large language models. We discuss three possible sources of such …
Eureka: Evaluating and understanding large foundation models
Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …
Is your model really a good math reasoner? evaluating mathematical reasoning with checklist
Exceptional mathematical reasoning ability is one of the key features that demonstrate the
power of large language models (LLMs). How to comprehensively define and evaluate the …
power of large language models (LLMs). How to comprehensively define and evaluate the …
Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …