Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Decoding-time language model alignment with multiple objectives
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …
enabling these models to better serve diverse user needs. Existing methods primarily focus …
Towards a unified view of preference learning for large language models: A survey
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …
factors to achieve success is aligning the LLM's output with human preferences. This …
Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …
(eg, creativity and safety). A key challenge is to develop steerable language models that …
The perfect blend: Redefining RLHF with mixture of judges
Reinforcement learning from human feedback (RLHF) has become the leading approach for
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …
Alignment of diffusion models: Fundamentals, challenges, and future
Diffusion models have emerged as the leading paradigm in generative modeling, excelling
in various applications. Despite their success, these models often misalign with human …
in various applications. Despite their success, these models often misalign with human …
Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …
Cascade reward sampling for efficient decoding-time alignment
Aligning large language models (LLMs) with human preferences is critical for their
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …
Personalization of large language models: A survey
Personalization of Large Language Models (LLMs) has recently become increasingly
important with a wide range of applications. Despite the importance and recent progress …
important with a wide range of applications. Despite the importance and recent progress …
Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …
However, existing methods, primarily based on preference datasets, face challenges such …