Training language models to self-correct via reinforcement learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - arxiv preprint arxiv:2405.14734, 2024 - arxiv.org
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

Rewarding progress: Scaling automated process verifiers for llm reasoning

A Setlur, C Nagpal, A Fisch, X Geng… - arxiv preprint arxiv …, 2024 - arxiv.org
A promising approach for improving reasoning in large language models is to use process
reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace …

Training large language models to reason in a continuous latent space

S Hao, S Sukhbaatar, DJ Su, X Li, Z Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are restricted to reason in the" language space", where they
typically express the reasoning process with a chain-of-thought (CoT) to solve a complex …

Iterative reasoning preference optimization

RY Pang, W Yuan, K Cho, H He, S Sukhbaatar… - arxiv preprint arxiv …, 2024 - arxiv.org
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks …

Large language models assume people are more rational than we really are

R Liu, J Geng, JC Peterson, I Sucholutsky… - arxiv preprint arxiv …, 2024 - arxiv.org
In order for AI systems to communicate effectively with people, they must understand how we
make decisions. However, people's decisions are not always rational, so the implicit internal …

Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling

J Qiu, Y Lu, Y Zeng, J Guo, J Geng, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Inference-time alignment enhances the performance of large language models without
requiring additional training or fine-tuning but presents challenges due to balancing …

Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic

X Zheng, J Lou, B Cao, X Wen, Y Ji, H Lin, Y Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-critic has become a crucial mechanism for enhancing the reasoning performance of
LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level …

From lists to emojis: How format bias affects model alignment

X Zhang, W **ong, L Chen, T Zhou, H Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we study format biases in reinforcement learning from human feedback
(RLHF). We observe that many widely-used preference models, including human …

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective

Z Zeng, Q Cheng, Z Yin, B Wang, S Li, Y Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-
level performances on many challanging tasks that require strong reasoning ability. OpenAI …