- Academic Search

R Rafailov, A Sharma, E Mitchell… - Advances in …, 2023 - proceedings.neurips.cc

While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …

Lagre Referanse Sitert av 2558 Beslektede artikler Alle 15 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Using human feedback to fine-tune diffusion models without any reward model

K Yang, J Tao, J Lyu, C Ge, J Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Using reinforcement learning with human feedback (RLHF) has shown significant promise in
fine-tuning diffusion models. Previous methods start by training a reward model that aligns …

Lagre Referanse Sitert av 51 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Lagre Referanse Sitert av 126 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2023 - proceedings.neurips.cc

We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

Lagre Referanse Sitert av 20 Beslektede artikler Alle 8 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

Lagre Referanse Sitert av 25 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Parl: A unified framework for policy alignment in reinforcement learning from human feedback

S Chakraborty, AS Bedi, A Koppel, D Manocha… - arxiv preprint arxiv …, 2023 - arxiv.org

We present a novel unified bilevel optimization-based framework,\textsf {PARL}, formulated
to address the recently highlighted critical issue of policy alignment in reinforcement …

Lagre Referanse Sitert av 24 Beslektede artikler Alle 7 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-turn reinforcement learning from preference human feedback

L Shani, A Rosenberg, A Cassel, O Lang… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach
for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to …

Lagre Referanse Sitert av 10 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rlvf: Learning from verbal feedback without overgeneralization

M Stephan, A Khazatsky, E Mitchell, AS Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

The diversity of contexts in which large language models (LLMs) are deployed requires the
ability to modify or customize default model behaviors to incorporate nuanced requirements …

Lagre Referanse Sitert av 9 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reward model learning vs. direct policy optimization: A comparative analysis of learning from human preferences

A Nika, D Mandal, P Kamalaruban, G Tzannetos… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we take a step towards a deeper understanding of learning from human
preferences by systematically comparing the paradigm of reinforcement learning from …

Lagre Referanse Sitert av 6 Beslektede artikler Alle 9 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On championing foundation models: From explainability to interpretability

S Fu, Y Chen, Y Wang, D Tao - arxiv preprint arxiv:2410.11444, 2024 - arxiv.org

Understanding the inner mechanisms of black-box foundation models (FMs) is essential yet
challenging in artificial intelligence and its applications. Over the last decade, the long …

Lagre Referanse Sitert av 2 Beslektede artikler Alle 2 versjoner HTML-versjon

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Dueling rl: Reinforcement learning with trajectory preferences

Direct preference optimization: Your language model is secretly a reward model

Using human feedback to fine-tune diffusion models without any reward model

[PDF][PDF] A survey of reinforcement learning from human feedback

Contextual bandits and imitation learning with preference-based active queries

Making rl with preference-based feedback efficient via randomization

Parl: A unified framework for policy alignment in reinforcement learning from human feedback

Multi-turn reinforcement learning from preference human feedback

Rlvf: Learning from verbal feedback without overgeneralization

Reward model learning vs. direct policy optimization: A comparative analysis of learning from human preferences

On championing foundation models: From explainability to interpretability