From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F Xu, Q Hao, Z Zong, J Wang, Y Zhang, J Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Language has long been conceived as an essential tool for human reasoning. The
breakthrough of Large Language Models (LLMs) has sparked significant research interest in …

Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

Z Zeng, Q Cheng, Z Yin, B Wang, S Li, Y Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-
level performances on many challanging tasks that require strong reasoning ability. OpenAI …

Process reinforcement through implicit rewards

G Cui, L Yuan, Z Wang, H Wang, W Li, B He… - arxiv preprint arxiv …, 2025 - arxiv.org
Dense process rewards have proven a more effective alternative to the sparse outcome-
level rewards in the inference-time scaling of large language models (LLMs), particularly in …

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Z Liu, Y Chen, M Shoeybi, B Catanzaro… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce AceMath, a suite of frontier math models that excel in solving
complex math problems, along with highly effective reward models capable of evaluating …

Enhancing llm reasoning via critique models with test-time and training-time supervision

Z **, D Yang, J Huang, J Tang, G Li, Y Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
Training large language models (LLMs) to spend more time thinking and reflection before
responding is crucial for effectively solving complex reasoning tasks in fields such as …

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

V **ang, C Snell, K Gandhi, A Albalak, A Singh… - arxiv preprint arxiv …, 2025 - arxiv.org
We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends
traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required …

RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement

J Jiang, J Chen, J Li, R Ren, S Wang, WX Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing large language models (LLMs) show exceptional problem-solving capabilities but
might struggle with complex reasoning tasks. Despite the successes of chain-of-thought and …

Progressive multimodal reasoning via active retrieval

G Dong, C Zhang, M Deng, Y Zhu, Z Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-step multimodal reasoning tasks pose significant challenges for multimodal large
language models (MLLMs), and finding effective ways to enhance their performance in such …

SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

R Ma, P Wang, C Liu, X Liu, J Chen, B Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However,
existing approaches to incentivize LLMs' deep thinking abilities generally require large …