- Academic Search

บันทึก อ้างอิง อ้างโดย29 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc

Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

บันทึก อ้างอิง อ้างโดย31 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold

A Setlur, S Garg, X Geng, N Garg… - Advances in Neural …, 2025 - proceedings.neurips.cc

Training on model-generated synthetic data is a promising approach for finetuning LLMs,
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …

บันทึก อ้างอิง อ้างโดย86 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ ดูในรูปแบบ HTML

Is dpo superior to ppo for llm alignment? a comprehensive study

S Xu, W Fu, J Gao, W Ye, W Liu, Z Mei, G Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used
method to align large language models (LLMs) with human preferences. Existing RLHF …

บันทึก อ้างอิง อ้างโดย80 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org

Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

บันทึก อ้างอิง อ้างโดย73 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing

Z Xu, F Jiang, L Niu, Y Deng, R Poovendran… - arxiv preprint arxiv …, 2024 - arxiv.org

High-quality instruction data is critical for aligning large language models (LLMs). Although
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …

บันทึก อ้างอิง อ้างโดย21 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

Bonbon alignment for large language models and the sweetness of best-of-n sampling

L Gui, C Gârbacea, V Veitch - Advances in Neural …, 2025 - proceedings.neurips.cc

This paper concerns the problem of aligning samples from large language models to human
preferences using* best-of-$ n $* sampling, where we draw $ n $ samples, rank them, and …

บันทึก อ้างอิง อ้างโดย59 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

Training language models to self-correct via reinforcement learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org

Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …

บันทึก อ้างอิง อ้างโดย28 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[PDF] openreview.net

Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff… - … on Machine Learning, 2024 - openreview.net

Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …