- Academic Search

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - ar** generalist AI systems that can autonomously act and pursue goals. Increases in …

Spara Citera Citerat av 180 Relaterade artiklar Alla 9 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - ar** biological, cyber, and …

Spara Citera Citerat av 105 Relaterade artiklar Alla 8 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] biocomm.ai

[PDF][PDF] Managing ai risks in an era of rapid progress

Y Bengio, G Hinton, A Yao, D Song… - arxiv preprint arxiv …, 2023 - blog.biocomm.ai

In this short consensus paper, we outline risks from upcoming, advanced AI systems. We
examine large-scale social harms and malicious uses, as well as an irreversible loss of …

Spara Citera Citerat av 95 Relaterade artiklar Alla 11 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] pnas.org

Deception abilities emerged in large language models

T Hagendorff - Proceedings of the National Academy of Sciences, 2024 - pnas.org

Large language models (LLMs) are currently at the forefront of intertwining AI systems with
human communication and everyday life. Thus, aligning them with human values is of great …

Spara Citera Citerat av 64 Relaterade artiklar Alla 7 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] i-love-ai.com

[PDF][PDF] Thousands of AI authors on the future of AI

K Grace, H Stewart, JF Sandkühler… - arxiv preprint arxiv …, 2024 - i-love-ai.com

In the largest survey of its kind, we surveyed 2,778 researchers who had published in top-
tier artificial intelligence (AI) venues, asking for their predictions on the pace of AI progress …

Spara Citera Citerat av 77 Relaterade artiklar Alla 11 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Spara Citera Citerat av 89 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Alignment for honesty

Y Yang, E Chern, X Qiu, G Neubig, P Liu - arxiv preprint arxiv:2312.07000, 2023 - arxiv.org

Recent research has made significant strides in applying alignment techniques to enhance
the helpfulness and harmlessness of large language models (LLMs) in accordance with …

Spara Citera Citerat av 60 Relaterade artiklar Alla 4 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

AI deception: A survey of examples, risks, and potential solutions

Ai alignment: A comprehensive survey

Sleeper agents: Training deceptive llms that persist through safety training

[PDF][PDF] Managing ai risks in an era of rapid progress

Deception abilities emerged in large language models

[PDF][PDF] Thousands of AI authors on the future of AI

Mechanistic Interpretability for AI Safety--A Review

Alignment for honesty