Building math agents with multi-turn iterative preference learning

W **ong, C Shi, J Shen, A Rosenberg, Z Qin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

Improve vision language model chain-of-thought reasoning

R Zhang, B Zhang, Y Li, H Zhang, Z Sun, Z Gan… - arxiv preprint arxiv …, 2024 - arxiv.org
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving
interpretability and trustworthiness. However, current training recipes lack robust CoT …

Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

J Liu, C Wang, CY Liu, L Zeng, R Yan, Y Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
The role of reinforcement learning (RL) in enhancing the reasoning of large language
models (LLMs) is becoming increasingly significant. Despite the success of RL in many …

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Z Guo, R Zhang, C Tong, Z Zhao, P Gao, H Li… - arxiv preprint arxiv …, 2025 - arxiv.org
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle
complex understanding tasks. However, it still remains an open question whether such …

Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards

H Do, S Ryu, GG Lee - arxiv preprint arxiv:2409.17472, 2024 - arxiv.org
Recent advances in automated essay scoring (AES) have shifted towards evaluating
multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES …

Mars-PO: Multi-Agent Reasoning System Preference Optimization

X Lou, C Wang, B An - arxiv preprint arxiv:2411.19039, 2024 - arxiv.org
Mathematical reasoning is a fundamental capability for large language models (LLMs), yet
achieving high performance in this domain remains a significant challenge. The auto …