- Academic Search

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Salva Cita Citato da 222 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

Salva Cita Citato da 129 Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter… - arxiv preprint arxiv …, 2023 - arxiv.org

Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …

Salva Cita Citato da 1059 Articoli correlati Tutte e 8 le versioni Versione HTML

[Free GPT-4]

[PDF] neurips.cc

Jailbroken: How does llm safety training fail?

A Wei, N Haghtalab… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models trained for safety and harmlessness remain susceptible to
adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases …

Salva Cita Citato da 748 Articoli correlati Tutte e 8 le versioni Versione HTML

[Free GPT-4]

[PDF] neurips.cc

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Salva Cita Citato da 267 Articoli correlati Tutte e 6 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Salva Cita Citato da 116 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Catastrophic jailbreak of open-source llms via exploiting generation

Y Huang, S Gupta, M **a, K Li, D Chen - arxiv preprint arxiv:2310.06987, 2023 - arxiv.org

The rapid progress in open-source large language models (LLMs) is significantly advancing
AI development. Extensive efforts have been made before model release to align their …

Salva Cita Citato da 218 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] aaai.org

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org

Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

Salva Cita Citato da 149 Articoli correlati Tutte e 5 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Y Yuan, W Jiao, W Wang, J Huang, P He, S Shi… - arxiv preprint arxiv …, 2023 - arxiv.org

Safety lies at the core of the development of Large Language Models (LLMs). There is
ample work on aligning LLMs with human ethics and preferences, including data filtering in …

Salva Cita Citato da 153 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …

Salva Cita Citato da 179 Articoli correlati Tutte e 3 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Automatically auditing large language models via discrete optimization

Ai alignment: A comprehensive survey

Survey of vulnerabilities in large language models revealed by adversarial attacks

Universal and transferable adversarial attacks on aligned language models

Jailbroken: How does llm safety training fail?

Are aligned neural networks adversarially aligned?

Foundational challenges in assuring alignment and safety of large language models

Catastrophic jailbreak of open-source llms via exploiting generation

Visual adversarial examples jailbreak aligned large language models

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms