Google 학술 검색

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

저장 인용 116회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

저장 인용 20회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]

[PDF] neurips.cc

Efficient algorithms for generalized linear bandits with heavy-tailed rewards

B Xue, Y Wang, Y Wan, J Yi… - Advances in Neural …, 2024 - proceedings.neurips.cc

This paper investigates the problem of generalized linear bandits with heavy-tailed rewards,
whose $(1+\epsilon) $-th moment is bounded for some $\epsilon\in (0, 1] $. Although there …

저장 인용 3회 인용 관련 학술자료 전체 7개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Provable benefits of policy learning from human preferences in contextual bandit problems

X Ji, H Wang, M Chen, T Zhao, M Wang - arxiv preprint arxiv:2307.12975, 2023 - arxiv.org

A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …

저장 인용 9회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T **, Y Wu, H Zhao, F Farnoud, Q Gu - arxiv preprint arxiv …, 2023 - arxiv.org

Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

저장 인용 11회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]

[PDF] neurips.cc

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc

We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

저장 인용 6회 인용 관련 학술자료 전체 6개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arxiv preprint arxiv:2402.09401, 2024 - arxiv.org

Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

저장 인용 14회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]

[PDF] mlr.press

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

R Deb, A Saha, A Banerjee - International Conference on …, 2024 - proceedings.mlr.press

We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …

저장 인용 2회 인용 관련 학술자료 전체 3개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Borda regret minimization for generalized linear dueling bandits

Y Wu, T **, H Lou, F Farnoud, Q Gu - arxiv preprint arxiv:2303.08816, 2023 - arxiv.org

Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …

저장 인용 9회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

[Free GPT-4]

[PDF] arxiv.org

Optimal design for reward modeling in rlhf

A Scheid, E Boursier, A Durmus, MI Jordan… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to
align language models (LMs) with human preferences. This method involves collecting a …

저장 인용 2회 인용 관련 학술자료 HTML 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Stochastic contextual dueling bandits under linear stochastic transitivity models

A survey of reinforcement learning from human feedback

Making rl with preference-based feedback efficient via randomization

Efficient algorithms for generalized linear bandits with heavy-tailed rewards

Provable benefits of policy learning from human preferences in contextual bandit problems

Variance-aware regret bounds for stochastic contextual dueling bandits

Contextual bandits and imitation learning with preference-based active queries

Reinforcement learning from human feedback with active queries

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

Borda regret minimization for generalized linear dueling bandits

Optimal design for reward modeling in rlhf