Google 학술 검색

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org

Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

저장 인용 65회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T **, Y Wu, H Zhao, F Farnoud, Q Gu - arxiv preprint arxiv …, 2023 - arxiv.org

Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

저장 인용 11회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]

[PDF] neurips.cc

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc

We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

저장 인용 6회 인용 관련 학술자료 전체 6개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arxiv preprint arxiv:2402.09401, 2024 - arxiv.org

Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

저장 인용 15회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Q Di, J He, Q Gu - arxiv preprint arxiv:2404.10776, 2024 - arxiv.org

Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …

저장 인용 1회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

H Zhao, C Ye, Q Gu, T Zhang - arxiv preprint arxiv:2411.04625, 2024 - arxiv.org

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …

저장 인용 1회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Feel-Good Thompson Sampling for Contextual Dueling Bandits

X Li, H Zhao, Q Gu - arxiv preprint arxiv:2404.06013, 2024 - arxiv.org

Contextual dueling bandits, where a learner compares two options based on context and
receives feedback indicating which was preferred, extends classic dueling bandits by …

저장 인용 5회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

Constrained Dueling Bandits for Edge Intelligence

S Wang, Z Shao, Y Yang - IEEE Transactions on Network …, 2024 - ieeexplore.ieee.org

Bandit is acknowledged as a classical analytic tool for the online decision-making problem
under uncertainty, eg, task assignment for crowdsourcing systems given the unknown …

저장 인용 관련 학술자료

[Free GPT-4]

[PDF] escholarship.org

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

Y Wu - 2024 - search.proquest.com

This dissertation investigates several challenges in artificial intelligence (AI) alignment and
reinforcement learning (RL), particularly focusing on applications when only preference …

저장 인용 관련 학술자료 전체 2개의 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Borda regret minimization for generalized linear dueling bandits

Self-play preference optimization for language model alignment

Variance-aware regret bounds for stochastic contextual dueling bandits

Contextual bandits and imitation learning with preference-based active queries

Reinforcement learning from human feedback with active queries

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Feel-Good Thompson Sampling for Contextual Dueling Bandits

Constrained Dueling Bandits for Edge Intelligence

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization