Causal graph guided steering of llm values via prompts and sparse autoencoders

Y Kang, J Wang, Y Li, F Zhong, X Feng, M Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly integrated into critical applications,
aligning their behavior with human values presents significant challenges. Current methods …

Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?

H Shen, N Clark, T Mitra - arxiv preprint arxiv:2501.15463, 2025 - arxiv.org
Existing research primarily evaluates the values of LLMs by examining their stated
inclinations towards specific values. However, the" Value-Action Gap," a phenomenon …

ICLR 2025 Workshop on Bidirectional Human-AI Alignment

H Shen, Z Ma, R Ghosh, T Knearem, MX Liu… - ICLR 2025 Workshop … - openreview.net
As AI systems grow more integrated into real-world applications, the traditional one-way
approach to AI alignment is proving insufficient. Bidirectional Human-AI Alignment proposes …