Ai alignment: A comprehensive survey
[HTML][HTML] AI deception: A survey of examples, risks, and potential solutions
This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some …
We define deception as the systematic inducement of false beliefs in the pursuit of some …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
The alignment problem from a deep learning perspective
In coming years or decades, artificial general intelligence (AGI) may surpass human
capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs …
capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs …
Black-box access is insufficient for rigorous ai audits
External audits of AI systems are increasingly recognized as a key mechanism for AI
governance. The effectiveness of an audit, however, depends on the degree of access …
governance. The effectiveness of an audit, however, depends on the degree of access …
Watch out for your agents! investigating backdoor threats to llm-based agents
Driven by the rapid development of Large Language Models (LLMs), LLM-based agents
have been developed to handle various real-world applications, including finance …
have been developed to handle various real-world applications, including finance …
Debating with more persuasive llms leads to more truthful answers
Common methods for aligning large language models (LLMs) with desired behaviour
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …