- Academic Search

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

保存引用被引用数: 30 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org

Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

保存引用被引用数: 8 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Towards a unified view of preference learning for large language models: A survey

B Gao, F Song, Y Miao, Z Cai, Z Yang, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …

保存引用被引用数: 6 関連記事 HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arxiv preprint arxiv …, 2024 - arxiv.org

Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

保存引用被引用数: 7 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

The perfect blend: Redefining RLHF with mixture of judges

T Xu, E Helenowski, KA Sankararaman, D **… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has become the leading approach for
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …

保存引用被引用数: 7 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Alignment of diffusion models: Fundamentals, challenges, and future

B Liu, S Shao, B Li, L Bai, Z Xu, H **ong, J Kwok… - arxiv preprint arxiv …, 2024 - arxiv.org

Diffusion models have emerged as the leading paradigm in generative modeling, excelling
in various applications. Despite their success, these models often misalign with human …

保存引用被引用数: 7 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

C Zou, X Guo, R Yang, J Zhang, B Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …

保存引用被引用数: 5 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Cascade reward sampling for efficient decoding-time alignment

B Li, Y Wang, A Grama, R Zhang - arxiv preprint arxiv:2406.16306, 2024 - arxiv.org

Aligning large language models (LLMs) with human preferences is critical for their
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …

保存引用被引用数: 5 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Personalization of large language models: A survey

Z Zhang, RA Rossi, B Kveton, Y Shao, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Personalization of Large Language Models (LLMs) has recently become increasingly
important with a wide range of applications. Despite the importance and recent progress …

保存引用被引用数: 3 関連記事 HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment

H Sun, M van der Schaar - arxiv preprint arxiv:2405.15624, 2024 - arxiv.org

Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …

保存引用被引用数: 8 関連記事全 2 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference...

Dpo meets ppo: Reinforced token optimization for rlhf

Decoding-time language model alignment with multiple objectives

Towards a unified view of preference learning for large language models: A survey

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

The perfect blend: Redefining RLHF with mixture of judges

Alignment of diffusion models: Fundamentals, challenges, and future

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Cascade reward sampling for efficient decoding-time alignment

Personalization of large language models: A survey

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment