Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

A survey of adversarial defenses and robustness in nlp

S Goyal, S Doddapaneni, MM Khapra… - ACM Computing …, 2023 - dl.acm.org
In the past few years, it has become increasingly evident that deep neural networks are not
resilient enough to withstand adversarial perturbations in input data, leaving them …

Trustworthy llms: a survey and guideline for evaluating large language models' alignment

Y Liu, Y Yao, JF Ton, X Zhang, R Guo, H Cheng… - arxiv preprint arxiv …, 2023 - arxiv.org
Ensuring alignment, which refers to making models behave in accordance with human
intentions [1, 2], has become a critical task before deploying large language models (LLMs) …

Easily accessible text-to-image generation amplifies demographic stereotypes at large scale

F Bianchi, P Kalluri, E Durmus, F Ladhak… - Proceedings of the …, 2023 - dl.acm.org
Machine learning models that convert user-written text descriptions into images are now
widely available online and used by millions of users to generate millions of images a day …

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arxiv preprint arxiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

A survey of safety and trustworthiness of large language models through the lens of verification and validation

X Huang, W Ruan, W Huang, G **, Y Dong… - Artificial Intelligence …, 2024 - Springer
Large language models (LLMs) have exploded a new heatwave of AI for their ability to
engage end-users in human-level conversations with detailed and articulate answers across …

Algorithmic content moderation: Technical and political challenges in the automation of platform governance

R Gorwa, R Binns, C Katzenbach - Big Data & Society, 2020 - journals.sagepub.com
As government pressure on major technology companies builds, both firms and legislators
are searching for technical solutions to difficult platform governance puzzles such as hate …

Quark: Controllable text generation with reinforced unlearning

X Lu, S Welleck, J Hessel, L Jiang… - Advances in neural …, 2022 - proceedings.neurips.cc
Large-scale language models often learn behaviors that are misaligned with user
expectations. Generated text may contain offensive or toxic language, contain significant …

Weight poisoning attacks on pre-trained models

K Kurita, P Michel, G Neubig - arxiv preprint arxiv:2004.06660, 2020 - arxiv.org
Recently, NLP has seen a surge in the usage of large pre-trained models. Users download
weights of models pre-trained on large datasets, then fine-tune the weights on a task of their …

Mind the style of text! adversarial and backdoor attacks based on text style transfer

F Qi, Y Chen, X Zhang, M Li, Z Liu, M Sun - arxiv preprint arxiv …, 2021 - arxiv.org
Adversarial attacks and backdoor attacks are two common security threats that hang over
deep learning. Both of them harness task-irrelevant features of data in their implementation …