Smoothllm: Defending large language models against jailbreaking attacks

A Robey, E Wong, H Hassani, GJ Pappas - arxiv preprint arxiv …, 2023‏ - arxiv.org
Despite efforts to align large language models (LLMs) with human intentions, widely-used
LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an …

Model-based domain generalization

A Robey, GJ Pappas… - Advances in Neural …, 2021‏ - proceedings.neurips.cc
Despite remarkable success in a variety of applications, it is well-known that deep learning
can fail catastrophically when presented with out-of-distribution data. Toward addressing …

Triangular Trade-off between Robustness, Accuracy, and Fairness in Deep Neural Networks: A Survey

J Li, G Li - ACM Computing Surveys, 2025‏ - dl.acm.org
With the rapid development of deep learning, AI systems are being used more in complex
and important domains and necessitates the simultaneous fulfillment of multiple constraints …

On the tradeoff between robustness and fairness

X Ma, Z Wang, W Liu - Advances in Neural Information …, 2022‏ - proceedings.neurips.cc
Abstract Interestingly, recent experimental results [2, 26, 22] have identified a robust fairness
phenomenon in adversarial training (AT), namely that a robust model well-trained by AT …

Do wider neural networks really help adversarial robustness?

B Wu, J Chen, D Cai, X He… - Advances in Neural …, 2021‏ - proceedings.neurips.cc
Adversarial training is a powerful type of defense against adversarial examples. Previous
empirical results suggest that adversarial training requires wider networks for better …

Better safe than sorry: Preventing delusive adversaries with adversarial training

L Tao, L Feng, J Yi, SJ Huang… - Advances in Neural …, 2021‏ - proceedings.neurips.cc
Delusive attacks aim to substantially deteriorate the test accuracy of the learning model by
slightly perturbing the features of correctly labeled training examples. By formalizing this …

The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression

H Hassani, A Javanmard - The Annals of Statistics, 2024‏ - projecteuclid.org
The curse of overparametrization in adversarial training: Precise analysis of robust
generalization for random features regressi Page 1 The Annals of Statistics 2024, Vol. 52, No. 2 …

Precise statistical analysis of classification accuracies for adversarial training

A Javanmard, M Soltanolkotabi - The Annals of Statistics, 2022‏ - projecteuclid.org
Precise statistical analysis of classification accuracies for adversarial training Page 1 The
Annals of Statistics 2022, Vol. 50, No. 4, 2127–2156 https://doi.org/10.1214/22-AOS2180 © …

Adversarial robustness with semi-infinite constrained learning

A Robey, L Chamon, GJ Pappas… - Advances in …, 2021‏ - proceedings.neurips.cc
Despite strong performance in numerous applications, the fragility of deep learning to input
perturbations has raised serious questions about its use in safety-critical domains. While …

Probabilistically robust learning: Balancing average and worst-case performance

A Robey, L Chamon, GJ Pappas… - … on Machine Learning, 2022‏ - proceedings.mlr.press
Many of the successes of machine learning are based on minimizing an averaged loss
function. However, it is well-known that this paradigm suffers from robustness issues that …