Personalizing reinforcement learning from human feedback with variational preference learning
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …
foundation models to human values and preferences. However, current RLHF techniques …
Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models?
Y **nai - arxiv preprint arxiv:2406.16316, 2024 - arxiv.org
Alignment of the language model with human preferences is a common approach to making
a language model useful to end users. However, most alignment work is done in English …
a language model useful to end users. However, most alignment work is done in English …
ProgressGym: Alignment with a Millennium of Moral Progress
Frontier AI systems, including large language models (LLMs), hold increasing influence over
the epistemology of human users. Such influence can reinforce prevailing societal values …
the epistemology of human users. Such influence can reinforce prevailing societal values …
Online Learning from Strategic Human Feedback in LLM Fine-Tuning
S Hao, L Duan - arxiv preprint arxiv:2412.16834, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has become an essential step in fine-
tuning large language models (LLMs) to align them with human preferences. However …
tuning large language models (LLMs) to align them with human preferences. However …
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
LLMs are increasingly used to design reward functions based on human preferences in
Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed …
Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed …
Representative Social Choice: From Learning Theory to AI Alignment
T Qiu - arxiv preprint arxiv:2410.23953, 2024 - arxiv.org
Social choice theory is the study of preference aggregation across a population, used both
in mechanism design for human agents and in the democratic alignment of language …
in mechanism design for human agents and in the democratic alignment of language …
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer
for a prompt. However, this preference data format does not convey why users prefer …
for a prompt. However, this preference data format does not convey why users prefer …
Direct Preference Optimization With Unobserved Preference Heterogeneity
RLHF has emerged as a pivotal step in aligning language models with human objectives
and values. It typically involves learning a reward model from human preference data and …
and values. It typically involves learning a reward model from human preference data and …
Can LLM be a Personalized Judge?
Ensuring that large language models (LLMs) reflect diverse user values and preferences is
crucial as their user bases expand globally. It is therefore encouraging to see the growing …
crucial as their user bases expand globally. It is therefore encouraging to see the growing …
False consensus biases AI against vulnerable stakeholders
The deployment of AI systems for welfare benefit allocation allows for accelerated decision-
making and faster provision of critical help, but has already led to an increase in unfair …
making and faster provision of critical help, but has already led to an increase in unfair …