Training language models to self-correct via reinforcement learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …
consistently been found to be largely ineffective in modern LLMs. Current methods for …
Simpo: Simple preference optimization with a reference-free reward
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …
algorithm that reparameterizes reward functions in reinforcement learning from human …
Rewarding progress: Scaling automated process verifiers for llm reasoning
A promising approach for improving reasoning in large language models is to use process
reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace …
reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace …
Training large language models to reason in a continuous latent space
Large language models (LLMs) are restricted to reason in the" language space", where they
typically express the reasoning process with a chain-of-thought (CoT) to solve a complex …
typically express the reasoning process with a chain-of-thought (CoT) to solve a complex …
Iterative reasoning preference optimization
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks …
general instruction tuning tasks, but typically make little improvement on reasoning tasks …
Large language models assume people are more rational than we really are
In order for AI systems to communicate effectively with people, they must understand how we
make decisions. However, people's decisions are not always rational, so the implicit internal …
make decisions. However, people's decisions are not always rational, so the implicit internal …
Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling
Inference-time alignment enhances the performance of large language models without
requiring additional training or fine-tuning but presents challenges due to balancing …
requiring additional training or fine-tuning but presents challenges due to balancing …
Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic
Self-critic has become a crucial mechanism for enhancing the reasoning performance of
LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level …
LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level …
From lists to emojis: How format bias affects model alignment
In this paper, we study format biases in reinforcement learning from human feedback
(RLHF). We observe that many widely-used preference models, including human …
(RLHF). We observe that many widely-used preference models, including human …
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-
level performances on many challanging tasks that require strong reasoning ability. OpenAI …
level performances on many challanging tasks that require strong reasoning ability. OpenAI …