Google 학술 검색

G Swamy, C Dann, R Kidambi, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

저장 인용 62회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Bond: Aligning llms with best-of-n distillation

PG Sessa, R Dadashi, L Hussenot, J Ferret… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in
state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time …

저장 인용 20회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

저장 인용 74회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sharp analysis for kl-regularized contextual bandits and rlhf

H Zhao, C Ye, Q Gu, T Zhang - arxiv preprint arxiv:2411.04625, 2024 - arxiv.org

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …

저장 인용 3회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Accelerating Goal-Conditioned RL Algorithms and Research

M Bortkiewicz, W Pałucki, V Myers, T Dziarmaga… - arxiv preprint arxiv …, 2024 - arxiv.org

Self-supervision has the potential to transform reinforcement learning (RL), paralleling the
breakthroughs it has enabled in other areas of machine learning. While self-supervised …

저장 인용 2회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Jackpot! Alignment as a Maximal Lottery

RR Maura-Rivero, M Lanctot, F Visin… - arxiv preprint arxiv …, 2025 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large
Language Models (LLMs) with human values, is known to fail to satisfy properties that are …

저장 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

S Zhang, Z Liu, B Liu, Y Zhang, Y Yang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Preference alignment in Large Language Models (LLMs) has significantly improved their
ability to adhere to human instructions and intentions. However, existing direct alignment …

저장 인용 1회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] escholarship.org

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

Y Wu - 2024 - search.proquest.com

This dissertation investigates several challenges in artificial intelligence (AI) alignment and
reinforcement learning (RL), particularly focusing on applications when only preference …

저장 인용 관련 학술자료 전체 2개의 버전

[Free GPT-4]
[DeepSeek]

[PDF] berkeley.edu

[PDF][PDF] ACCELERATING GOAL-CONDITIONED REINFORCE-MENT LEARNING ALGORITHMS AND RESEARCH

M Bortkiewicz, W Pałucki, V Myers, T Dziarmaga… - people.eecs.berkeley.edu

Self-supervision has the potential to transform reinforcement learning (RL), paralleling the
breakthroughs it has enabled in other areas of machine learning. While self-supervised …

저장 인용 관련 학술자료 HTML 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Human alignment of large language models through online preference optimisation

A minimaximalist approach to reinforcement learning from human feedback

Bond: Aligning llms with best-of-n distillation

Rlhf workflow: From reward modeling to online rlhf

Sharp analysis for kl-regularized contextual bandits and rlhf

Accelerating Goal-Conditioned RL Algorithms and Research

Jackpot! Alignment as a Maximal Lottery

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

[PDF][PDF] ACCELERATING GOAL-CONDITIONED REINFORCE-MENT LEARNING ALGORITHMS AND RESEARCH