- Academic Search

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - ar** or herding? reward model ensembles mitigate but do not eliminate reward hacking

J Eisenstein, C Nagpal, A Agarwal, A Beirami… - arxiv preprint arxiv …, 2023 - arxiv.org

Reward models play a key role in aligning language model applications towards human
preferences. However, this setup creates an incentive for the language model to exploit …

Gem Citer Citeret af 48 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Gem Citer Citeret af 25 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm alignment

H Sun, M van der Schaar - arxiv preprint arxiv:2405.15624, 2024 - arxiv.org

Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …

Gem Citer Citeret af 9 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization

A Huang, W Zhan, T **e, JD Lee, W Sun… - arxiv preprint arxiv …, 2024 - arxiv.org

Language model alignment methods, such as reinforcement learning from human feedback
(RLHF), have led to impressive advances in language model capabilities, but existing …

Gem Citer Citeret af 6 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On Uncertainty In Natural Language Processing

D Ulmer - arxiv preprint arxiv:2410.03446, 2024 - arxiv.org

The last decade in deep learning has brought on increasingly capable systems that are
deployed on a wide variety of applications. In natural language processing, the field has …

Gem Citer Citeret af 1 Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

H Sun, Y Shen, JF Ton, M van der Schaar - arxiv preprint arxiv …, 2025 - arxiv.org

Large Language Models (LLMs) have made substantial strides in structured tasks through
Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

Y Shen, H Sun, JF Ton - arxiv preprint arxiv:2502.04354, 2025 - arxiv.org

Building neural reward models from human preferences is a pivotal component in
reinforcement learning from human feedback (RLHF) and large language model alignment …

Gem Citer Relaterede artikler Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

P Rashidinejad, Y Tian - arxiv preprint arxiv:2412.09544, 2024 - arxiv.org

Aligning AI systems with human preferences typically suffers from the infamous reward
hacking problem, where optimization of an imperfect reward model leads to undesired …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

Y Miao, S Zhang, L Ding, Y Zhang, L Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org

This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human
Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final …

Gem Citer Relaterede artikler Vis som HTML

Citer

Avanceret søgning

Gemt i Min samling

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Regularizing hidden states enables learning generalizable reward model for llms

Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm alignment

Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization

On Uncertainty In Natural Language Processing

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking