Livebench: A challenging, contamination-free llm benchmark

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024 - arxiv.org
Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

E Glazer, E Erdil, T Besiroglu, D Chicharro… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging
mathematics problems crafted and vetted by expert mathematicians. The questions cover …

Processbench: Identifying process errors in mathematical reasoning

C Zheng, Z Zhang, B Zhang, R Lin, K Lu, B Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
As language models regularly make mistakes when solving math problems, automated
identification of errors in the reasoning process becomes increasingly significant for their …

Are Your LLMs Capable of Stable Reasoning?

J Liu, H Liu, L **ao, Z Wang, K Liu, S Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable
progress in complex reasoning tasks. However, a significant discrepancy persists between …

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

S Quan, J Yang, B Yu, B Zheng, D Liu, A Yang… - arxiv preprint arxiv …, 2025 - arxiv.org
With the increasing code reasoning capabilities of existing large language models (LLMs)
and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to …

A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

TR Wei, H Liu, X Wu, Y Fang - arxiv preprint arxiv:2502.14333, 2025 - arxiv.org
Recent progress in large language models (LLM) found chain-of-thought prompting
strategies to improve the reasoning ability of LLMs by encouraging problem solving through …

Examining False Positives under Inference Scaling for Mathematical Reasoning

Y Wang, N Yang, L Wang, F Wei - arxiv preprint arxiv:2502.06217, 2025 - arxiv.org
Recent advancements in language models have led to significant improvements in
mathematical reasoning across various benchmarks. However, most of these benchmarks …

On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

S Ibragimov, A Jentzen, B Kuckuck - arxiv preprint arxiv:2502.14180, 2025 - arxiv.org
We present a method of generating first-order logic statements whose complexity can be
controlled along multiple dimensions. We use this method to automatically create several …

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

H Ko, G Son, D Choi - arxiv preprint arxiv:2501.02448, 2025 - arxiv.org
Large language models (LLMs) demonstrate exceptional performance on complex
reasoning tasks. However, despite their strong reasoning capabilities in high-resource …

FastMCTS: A Simple Sampling Strategy for Data Synthesis

P Li, K Lv, Y Shao, Y Ma, L Li, X Zheng, X Qiu… - arxiv preprint arxiv …, 2025 - arxiv.org
Synthetic high-quality multi-step reasoning data can significantly enhance the performance
of large language models on various tasks. However, most existing methods rely on …