Google Učenjak

P Tong, E Brown, P Wu, S Woo… - Advances in …, 2025 - proceedings.neurips.cc

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Shrani Navedi Navedeno v 215 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025 - proceedings.neurips.cc

Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …

Shrani Navedi Navedeno v 235 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W **ong, H Dong, C Ye, Z Wang, H Zhong, H Ji… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper studies the alignment process of generative models with Reinforcement Learning
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …

Shrani Navedi Navedeno v 106 virih Sorodni članki Vse različice: 9 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc

Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Shrani Navedi Navedeno v 31 virih Sorodni članki Vse različice: 3 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Shrani Navedi Navedeno v 88 virih Sorodni članki Vse različice: 4 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - Advances in Neural …, 2025 - proceedings.neurips.cc

Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

Shrani Navedi Navedeno v 23 virih Sorodni članki Vse različice: 7 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Shrani Navedi Navedeno v 41 virih Sorodni članki Vse različice: 4 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Lipo: Listwise preference optimization through learning-to-rank

T Liu, Z Qin, J Wu, J Shen, M Khalman, R Joshi… - arxiv preprint arxiv …, 2024 - arxiv.org

Aligning language models (LMs) with curated human feedback is critical to control their
behaviors in real-world applications. Several recent policy optimization methods, such as …

Shrani Navedi Navedeno v 32 virih Sorodni članki Vse različice: 3 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Weak-to-strong extrapolation expedites alignment

C Zheng, Z Wang, H Ji, M Huang, N Peng - arxiv preprint arxiv …, 2024 - arxiv.org

The open-source community is experiencing a surge in the release of large language
models (LLMs) that are trained to follow instructions and align with human preference …

Shrani Navedi Navedeno v 35 virih Sorodni članki Vse različice: 4 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf

T **e, DJ Foster, A Krishnamurthy, C Rosset… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

Shrani Navedi Navedeno v 30 virih Sorodni članki Vse različice: 3 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Rlhf workflow: From reward modeling to online rlhf

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Simpo: Simple preference optimization with a reference-free reward

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

Regularizing hidden states enables learning generalizable reward model for llms

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

Aligning to thousands of preferences via system message generalization

Dpo meets ppo: Reinforced token optimization for rlhf

Lipo: Listwise preference optimization through learning-to-rank

Weak-to-strong extrapolation expedites alignment

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf