- Academic Search

K Ethayarajh, W Xu, N Muennighoff, D Jurafsky… - arxiv preprint arxiv …, 2024 - arxiv.org

Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …

Tallenna Viittaa Viittausten määrä 318 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold

A Setlur, S Garg, X Geng, N Garg… - Advances in Neural …, 2025 - proceedings.neurips.cc

Training on model-generated synthetic data is a promising approach for finetuning LLMs,
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …

Tallenna Viittaa Viittausten määrä 31 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Tallenna Viittaa Viittausten määrä 82 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org

Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Tallenna Viittaa Viittausten määrä 73 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

Tallenna Viittaa Viittausten määrä 82 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Tallenna Viittaa Viittausten määrä 126 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Rebel: Reinforcement learning via regressing relative rewards

Z Gao, J Chang, W Zhan, O Oertell… - Advances in …, 2025 - proceedings.neurips.cc

While originally developed for continuous control problems, Proximal Policy Optimization
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …

Tallenna Viittaa Viittausten määrä 22 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff… - … on Machine Learning, 2024 - openreview.net

Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …

Tallenna Viittaa Viittausten määrä 25 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

H Wang, Y Lin, W **ong, R Yang, S Diao, S Qiu… - arxiv preprint arxiv …, 2024 - arxiv.org

Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …

Tallenna Viittaa Viittausten määrä 57 Aiheeseen liittyviä artikkeleita Kaikki 8 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Generalized preference optimization: A unified approach to offline alignment

Y Tang, ZD Guo, Z Zheng, D Calandriello… - arxiv preprint arxiv …, 2024 - arxiv.org

Offline preference optimization allows fine-tuning large models directly from offline data, and
has proved effective in recent alignment practices. We propose generalized preference …

Tallenna Viittaa Viittausten määrä 66 Aiheeseen liittyviä artikkeleita Kaikki 7 versiota HTML-versio

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Nash learning from human feedback

Kto: Model alignment as prospect theoretic optimization

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

Self-play preference optimization for language model alignment

Direct nash optimization: Teaching language models to self-improve with general preferences

[PDF][PDF] A survey of reinforcement learning from human feedback

Rebel: Reinforcement learning via regressing relative rewards

Model alignment as prospect theoretic optimization

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

Generalized preference optimization: A unified approach to offline alignment