- Academic Search

R Pi, T Han, W **ong, J Zhang, R Liu, R Pan… - … on Computer Vision, 2024 - Springer

Abstract Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating responses …

Zapisz Cytuj Cytowane przez 21 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] arxiv.org

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Zapisz Cytuj Cytowane przez 23 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Zapisz Cytuj Cytowane przez 31 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Skywork-reward: Bag of tricks for reward modeling in llms

CY Liu, L Zeng, J Liu, R Yan, J He, C Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this report, we introduce a collection of methods to enhance reward modeling for LLMs,
focusing specifically on data-centric techniques. We propose effective data selection and …

Zapisz Cytuj Cytowane przez 13 Powiązane artykuły Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Zapisz Cytuj Cytowane przez 66 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Zapisz Cytuj Cytowane przez 95 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org

Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

Zapisz Cytuj Cytowane przez 8 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arxiv preprint arxiv …, 2024 - arxiv.org

Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

Zapisz Cytuj Cytowane przez 7 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Llm-as-a-judge & reward model: What they can and cannot do

G Son, H Ko, H Lee, Y Kim, S Hong - arxiv preprint arxiv:2409.11239, 2024 - arxiv.org

LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …

Zapisz Cytuj Cytowane przez 9 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Bi-factorial preference optimization: Balancing safety-helpfulness in language models

W Zhang, PHS Torr, M Elhoseiny, A Bibi - arxiv preprint arxiv:2408.15313, 2024 - arxiv.org

Fine-tuning large language models (LLMs) on human preferences, typically through
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …

Zapisz Cytuj Cytowane przez 3 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Arithmetic control of llms for diverse user preferences: Directional preference alignment...

Strengthening multimodal large language model with bootstrapped preference optimization

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Dpo meets ppo: Reinforced token optimization for rlhf

Skywork-reward: Bag of tricks for reward modeling in llms

Rlhf workflow: From reward modeling to online rlhf

Prometheus 2: An open source language model specialized in evaluating other language models

Decoding-time language model alignment with multiple objectives

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Llm-as-a-judge & reward model: What they can and cannot do

Bi-factorial preference optimization: Balancing safety-helpfulness in language models