محقق Google

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025‏ - proceedings.neurips.cc‏

Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …‏

ذخیره ارجاع بیان شده در 233 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Interpretable preferences via multi-objective reward modeling and mixture-of-experts‏

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …‏

ذخیره ارجاع بیان شده در 86 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Flame: Factuality-aware alignment for large language models‏

SC Lin, L Gao, B Oguz, W **ong… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

Alignment is a procedure to fine-tune pre-trained large language models (LLMs) to follow
natural language instructions and serve as helpful AI assistants. We have observed …‏

ذخیره ارجاع بیان شده در 19 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Length-controlled alpacaeval: A simple way to debias automatic evaluators‏

Y Dubois, B Galambosi, P Liang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

LLM-based auto-annotators have become a key component of the LLM development
process due to their cost-effectiveness and scalability compared to human-based …‏

ذخیره ارجاع بیان شده در 229 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards‏

H Wang, Y Lin, W **ong, R Yang, S Diao, S Qiu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …‏

ذخیره ارجاع بیان شده در 62 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Length-controlled alpacaeval: A simple debiasing of automatic evaluators‏

Y Dubois, P Liang, T Hashimoto - First Conference on Language …, 2024‏ - openreview.net‏

LLM-based auto-annotators have become a key component of the LLM development
process due to their cost-effectiveness and scalability compared to human-based …‏

ذخیره ارجاع بیان شده در 17 یافته مقاله‌های مربوط نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Uncertainty-aware reward model: Teaching reward models to know what is unknown‏

X Lou, D Yan, W Shen, Y Yan, J **e… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Reward models (RM) play a critical role in aligning generations of large language models
(LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity …‏

ذخیره ارجاع بیان شده در 11 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling‏

Y Miao, S Zhang, L Ding, R Bao… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

Despite the success of reinforcement learning from human feedback (RLHF) in aligning
language models with human values, reward hacking, also termed reward overoptimization …‏

ذخیره ارجاع بیان شده در 7 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-generated critiques boost reward modeling for language models‏

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …‏

ذخیره ارجاع بیان شده در 9 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regularization‏

J **ao, Z Li, X **e, E Getzen, C Fang, Q Long… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Accurately aligning large language models (LLMs) with human preferences is crucial for
informing fair, economically sound, and statistically efficient decision-making processes …‏

ذخیره ارجاع بیان شده در 9 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

Odin: Disentangled reward mitigates hacking in rlhf

Simpo: Simple preference optimization with a reference-free reward‏

Interpretable preferences via multi-objective reward modeling and mixture-of-experts‏

Flame: Factuality-aware alignment for large language models‏

Length-controlled alpacaeval: A simple way to debias automatic evaluators‏

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards‏

Length-controlled alpacaeval: A simple debiasing of automatic evaluators‏

Uncertainty-aware reward model: Teaching reward models to know what is unknown‏

Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling‏

Self-generated critiques boost reward modeling for language models‏

On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regularization‏