Google Академія

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024 - arxiv.org

Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …

Зберегти Послатися Цитовано в 50 джерелах Пов’язані статті Кількість версій: 5 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

E Glazer, E Erdil, T Besiroglu, D Chicharro… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging
mathematics problems crafted and vetted by expert mathematicians. The questions cover …

Зберегти Послатися Цитовано в 20 джерелах Пов’язані статті Кількість версій: 5 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Processbench: Identifying process errors in mathematical reasoning

C Zheng, Z Zhang, B Zhang, R Lin, K Lu, B Yu… - arxiv preprint arxiv …, 2024 - arxiv.org

As language models regularly make mistakes when solving math problems, automated
identification of errors in the reasoning process becomes increasingly significant for their …

Зберегти Послатися Цитовано в 18 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Are Your LLMs Capable of Stable Reasoning?

J Liu, H Liu, L **ao, Z Wang, K Liu, S Gao… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable
progress in complex reasoning tasks. However, a significant discrepancy persists between …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

S Quan, J Yang, B Yu, B Zheng, D Liu, A Yang… - arxiv preprint arxiv …, 2025 - arxiv.org

With the increasing code reasoning capabilities of existing large language models (LLMs)
and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

TR Wei, H Liu, X Wu, Y Fang - arxiv preprint arxiv:2502.14333, 2025 - arxiv.org

Recent progress in large language models (LLM) found chain-of-thought prompting
strategies to improve the reasoning ability of LLMs by encouraging problem solving through …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Examining False Positives under Inference Scaling for Mathematical Reasoning

Y Wang, N Yang, L Wang, F Wei - arxiv preprint arxiv:2502.06217, 2025 - arxiv.org

Recent advancements in language models have led to significant improvements in
mathematical reasoning across various benchmarks. However, most of these benchmarks …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

S Ibragimov, A Jentzen, B Kuckuck - arxiv preprint arxiv:2502.14180, 2025 - arxiv.org

We present a method of generating first-order logic statements whose complexity can be
controlled along multiple dimensions. We use this method to automatically create several …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

H Ko, G Son, D Choi - arxiv preprint arxiv:2501.02448, 2025 - arxiv.org

Large language models (LLMs) demonstrate exceptional performance on complex
reasoning tasks. However, despite their strong reasoning capabilities in high-resource …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FastMCTS: A Simple Sampling Strategy for Data Synthesis

P Li, K Lv, Y Shao, Y Ma, L Li, X Zheng, X Qiu… - arxiv preprint arxiv …, 2025 - arxiv.org

Synthetic high-quality multi-step reasoning data can significantly enhance the performance
of large language models on various tasks. However, most existing methods rely on …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Omni-math: A universal olympiad level mathematic benchmark for large language models

Livebench: A challenging, contamination-free llm benchmark

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

Processbench: Identifying process errors in mathematical reasoning

Are Your LLMs Capable of Stable Reasoning?

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

Examining False Positives under Inference Scaling for Mathematical Reasoning

On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

FastMCTS: A Simple Sampling Strategy for Data Synthesis