Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Kto: Model alignment as prospect theoretic optimization
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …
variables in a biased but well-defined manner (1992); for example, humans are famously …
Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold
Training on model-generated synthetic data is a promising approach for finetuning LLMs,
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …
Interpretable preferences via multi-objective reward modeling and mixture-of-experts
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …
for aligning large language models (LLMs) with human preferences. The RLHF process …
Self-play preference optimization for language model alignment
Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …
Direct nash optimization: Teaching language models to self-improve with general preferences
This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …
from a powerful oracle to help a model iteratively improve over itself. The typical approach …
[PDF][PDF] A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Rebel: Reinforcement learning via regressing relative rewards
While originally developed for continuous control problems, Proximal Policy Optimization
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …
Model alignment as prospect theoretic optimization
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …
variables in a biased but well-defined manner (1992); for example, humans are famously …
Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards
Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …
hindering their adaptability to diverse user needs. While Reinforcement Learning from …
Generalized preference optimization: A unified approach to offline alignment
Offline preference optimization allows fine-tuning large models directly from offline data, and
has proved effective in recent alignment practices. We propose generalized preference …
has proved effective in recent alignment practices. We propose generalized preference …