Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - ar** generalist AI systems that can autonomously act and pursue goals. Increases in …

[PDF][PDF] Managing ai risks in an era of rapid progress

Y Bengio, G Hinton, A Yao, D Song… - arxiv preprint arxiv …, 2023 - blog.biocomm.ai
In this short consensus paper, we outline risks from upcoming, advanced AI systems. We
examine large-scale social harms and malicious uses, as well as an irreversible loss of …

Deception abilities emerged in large language models

T Hagendorff - Proceedings of the National Academy of Sciences, 2024 - pnas.org
Large language models (LLMs) are currently at the forefront of intertwining AI systems with
human communication and everyday life. Thus, aligning them with human values is of great …

[PDF][PDF] Thousands of AI authors on the future of AI

K Grace, H Stewart, JF Sandkühler… - arxiv preprint arxiv …, 2024 - i-love-ai.com
In the largest survey of its kind, we surveyed 2,778 researchers who had published in top-
tier artificial intelligence (AI) venues, asking for their predictions on the pace of AI progress …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Alignment for honesty

Y Yang, E Chern, X Qiu, G Neubig, P Liu - arxiv preprint arxiv:2312.07000, 2023 - arxiv.org
Recent research has made significant strides in applying alignment techniques to enhance
the helpfulness and harmlessness of large language models (LLMs) in accordance with …