Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Survey of vulnerabilities in large language models revealed by adversarial attacks
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …
they integrate more deeply into complex systems, the urgency to scrutinize their security …
Universal and transferable adversarial attacks on aligned language models
Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …
objectionable content, recent work has focused on aligning these models in an attempt to …
Jailbroken: How does llm safety training fail?
Large language models trained for safety and harmlessness remain susceptible to
adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases …
adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases …
Are aligned neural networks adversarially aligned?
Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …
helpful and harmless." These models should respond helpfully to user questions, but refuse …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Catastrophic jailbreak of open-source llms via exploiting generation
The rapid progress in open-source large language models (LLMs) is significantly advancing
AI development. Extensive efforts have been made before model release to align their …
AI development. Extensive efforts have been made before model release to align their …
Visual adversarial examples jailbreak aligned large language models
Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …
Recently, there has been a surge of interest in integrating vision into Large Language …
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher
Safety lies at the core of the development of Large Language Models (LLMs). There is
ample work on aligning LLMs with human ethics and preferences, including data filtering in …
ample work on aligning LLMs with human ethics and preferences, including data filtering in …
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms
Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …
algorithm-focused attacks developed by security experts. As large language models (LLMs) …