Reinforcement Learning Enhanced LLMs: A Survey

S Wang, S Zhang, J Zhang, R Hu, X Li, T Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper surveys research in the rapidly growing field of enhancing large language
models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve …

Towards a unified view of preference learning for large language models: A survey

B Gao, F Song, Y Miao, Z Cai, Z Yang, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Z Liu, Y Chen, M Shoeybi, B Catanzaro… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce AceMath, a suite of frontier math models that excel in solving
complex math problems, along with highly effective reward models capable of evaluating …

Free process rewards without process labels

L Yuan, W Li, H Chen, G Cui, N Ding, K Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Different from its counterpart outcome reward models (ORMs), which evaluate the entire
responses, a process reward model (PRM) scores a reasoning trajectory step by step …

JuStRank: Benchmarking LLM Judges for System Ranking

A Gera, O Boni, Y Perlitz, R Bar-Haim, L Eden… - arxiv preprint arxiv …, 2024 - arxiv.org
Given the rapid progress of generative AI, there is a pressing need to systematically
compare and choose between the numerous models and configurations available. The …

An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems

H Shaik, A Doboli - arxiv preprint arxiv:2501.00562, 2024 - arxiv.org
Large Language Models offer new opportunities to devise automated implementation
generation methods that can tackle problem solving activities beyond traditional methods …

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

L Dou, Q Liu, F Zhou, C Chen, Z Wang, Z **… - arxiv preprint arxiv …, 2025 - arxiv.org
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA)
languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on …

InternLM-XComposer2. 5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Y Zang, X Dong, P Zhang, Y Cao, Z Liu, S Ding… - arxiv preprint arxiv …, 2025 - arxiv.org
Despite the promising performance of Large Vision Language Models (LVLMs) in visual
understanding, they occasionally generate incorrect outputs. While reward models (RMs) …

Less is More: Improving LLM Alignment via Preference Data Selection

X Deng, H Zhong, R Ai, F Feng, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning
large language models with human preferences. While prior work mainly extends DPO from …

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

X Yang, L Zeng, H Dong, C Yu, X Wu, H Yang… - arxiv preprint arxiv …, 2025 - arxiv.org
As humans increasingly share environments with diverse agents powered by RL, LLMs, and
beyond, the ability to explain their policies in natural language will be vital for reliable …