Personalizing reinforcement learning from human feedback with variational preference learning

S Poddar, Y Wan, H Ivison, A Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …

Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models?

Y **nai - arxiv preprint arxiv:2406.16316, 2024 - arxiv.org
Alignment of the language model with human preferences is a common approach to making
a language model useful to end users. However, most alignment work is done in English …

ProgressGym: Alignment with a Millennium of Moral Progress

T Qiu, Y Zhang, X Huang, JX Li, J Ji, Y Yang - arxiv preprint arxiv …, 2024 - arxiv.org
Frontier AI systems, including large language models (LLMs), hold increasing influence over
the epistemology of human users. Such influence can reinforce prevailing societal values …

Online Learning from Strategic Human Feedback in LLM Fine-Tuning

S Hao, L Duan - arxiv preprint arxiv:2412.16834, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has become an essential step in fine-
tuning large language models (LLMs) to align them with human preferences. However …

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

S Verma, N Boehmer, L Kong, M Tambe - arxiv preprint arxiv:2408.12112, 2024 - arxiv.org
LLMs are increasingly used to design reward functions based on human preferences in
Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed …

Representative Social Choice: From Learning Theory to AI Alignment

T Qiu - arxiv preprint arxiv:2410.23953, 2024 - arxiv.org
Social choice theory is the study of preference aggregation across a population, used both
in mechanism design for human agents and in the democratic alignment of language …

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

N Balepur, V Padmakumar, F Yang, S Feng… - arxiv preprint arxiv …, 2025 - arxiv.org
LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer
for a prompt. However, this preference data format does not convey why users prefer …

Direct Preference Optimization With Unobserved Preference Heterogeneity

K Chidambaram, KV Seetharaman… - arxiv preprint arxiv …, 2024 - arxiv.org
RLHF has emerged as a pivotal step in aligning language models with human objectives
and values. It typically involves learning a reward model from human preference data and …

Can LLM be a Personalized Judge?

YR Dong, T Hu, N Collier - arxiv preprint arxiv:2406.11657, 2024 - arxiv.org
Ensuring that large language models (LLMs) reflect diverse user values and preferences is
crucial as their user bases expand globally. It is therefore encouraging to see the growing …

False consensus biases AI against vulnerable stakeholders

M Dong, JF Bonnefon, I Rahwan - arxiv preprint arxiv:2407.12143, 2024 - arxiv.org
The deployment of AI systems for welfare benefit allocation allows for accelerated decision-
making and faster provision of critical help, but has already led to an increase in unfair …