The benefits, risks and bounds of personalizing the alignment of large language models to individuals

HR Kirk, B Vidgen, P Röttger, SA Hale - Nature Machine Intelligence, 2024 - nature.com
Large language models (LLMs) undergo 'alignment'so that they better reflect human values
or preferences, and are safer or more useful. However, alignment is intrinsically difficult …

Personalizing reinforcement learning from human feedback with variational preference learning

S Poddar, Y Wan, H Ivison, A Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …

Are Large Language Models Consistent over Value-laden Questions?

J Moore, T Deshpande, D Yang - arxiv preprint arxiv:2407.02996, 2024 - arxiv.org
Large language models (LLMs) appear to bias their survey answers toward certain values.
Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are …

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

H Wang, W Fu, Y Tang, Z Chen, Y Huang… - arxiv preprint arxiv …, 2025 - arxiv.org
While large language models (LLMs) present significant potential for supporting numerous
real-world applications and delivering positive social impacts, they still face significant …

Culture-gen: Revealing global cultural perception in language models through natural language prompting

H Li, L Jiang, JD Hwang, H Kim, S Santy… - arxiv preprint arxiv …, 2024 - arxiv.org
As the utilization of large language models (LLMs) has proliferated world-wide, it is crucial
for them to have adequate knowledge and fair representation for diverse global cultures. In …

Beyond preferences in ai alignment

T Zhi-Xuan, M Carroll, M Franklin, H Ashton - Philosophical Studies, 2024 - Springer
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values,(2) that human rationality can be understood in terms of …