- Academic Search

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025 - proceedings.neurips.cc

Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …

Spara Citera Citerat av 223 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Spara Citera Citerat av 115 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Spara Citera Citerat av 86 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Spara Citera Citerat av 84 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Strengthening multimodal large language model with bootstrapped preference optimization

R Pi, T Han, W **ong, J Zhang, R Liu, R Pan… - … on Computer Vision, 2024 - Springer

Abstract Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating responses …

Spara Citera Citerat av 25 Relaterade artiklar Alla 5 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Spara Citera Citerat av 39 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Spara Citera Citerat av 28 Relaterade artiklar Alla 3 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Skywork-reward: Bag of tricks for reward modeling in llms

CY Liu, L Zeng, J Liu, R Yan, J He, C Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this report, we introduce a collection of methods to enhance reward modeling for LLMs,
focusing specifically on data-centric techniques. We propose effective data selection and …

Spara Citera Citerat av 25 Relaterade artiklar Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Spara Citera Citerat av 27 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Weak-to-strong search: Align large language models via searching over small language models

Z Zhou, Z Liu, J Liu, Z Dong, C Yang, Y Qiao - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models are usually fine-tuned to align with human preferences. However,
fine-tuning a large language model can be challenging. In this work, we introduce $\textit …

Spara Citera Citerat av 11 Relaterade artiklar Alla 4 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Arithmetic control of llms for diverse user preferences: Directional preference alignment...

Simpo: Simple preference optimization with a reference-free reward

Prometheus 2: An open source language model specialized in evaluating other language models

Rlhf workflow: From reward modeling to online rlhf

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

Strengthening multimodal large language model with bootstrapped preference optimization

Dpo meets ppo: Reinforced token optimization for rlhf

Regularizing hidden states enables learning generalizable reward model for llms

Skywork-reward: Bag of tricks for reward modeling in llms

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Weak-to-strong search: Align large language models via searching over small language models