Scalable agent alignment via reward modeling: a research direction
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …
lack of suitable reward functions. Designing such reward functions is difficult in part because …
Taxonomy of machine learning safety: A survey and primer
The open-world deployment of Machine Learning (ML) algorithms in safety-critical
applications such as autonomous vehicles needs to address a variety of ML vulnerabilities …
applications such as autonomous vehicles needs to address a variety of ML vulnerabilities …
Certified adversarial robustness via randomized smoothing
We show how to turn any classifier that classifies well under Gaussian noise into a new
classifier that is certifiably robust to adversarial perturbations under the L2 norm. While this" …
classifier that is certifiably robust to adversarial perturbations under the L2 norm. While this" …
Adversarial glue: A multi-task benchmark for robustness evaluation of language models
Large-scale pre-trained language models have achieved tremendous success across a
wide range of natural language understanding (NLU) tasks, even surpassing human …
wide range of natural language understanding (NLU) tasks, even surpassing human …
Provably robust deep learning via adversarially trained smoothed classifiers
Recent works have shown the effectiveness of randomized smoothing as a scalable
technique for building neural network-based classifiers that are provably robust to $\ell_2 …
technique for building neural network-based classifiers that are provably robust to $\ell_2 …
Robustness may be at odds with accuracy
We show that there may exist an inherent tension between the goal of adversarial
robustness and that of standard generalization. Specifically, training robust models may not …
robustness and that of standard generalization. Specifically, training robust models may not …
Certified robustness to adversarial examples with differential privacy
Adversarial examples that fool machine learning models, particularly deep neural networks,
have been a topic of intense research interest, with attacks and defenses being developed …
have been a topic of intense research interest, with attacks and defenses being developed …
When does contrastive learning preserve adversarial robustness from pretraining to finetuning?
Contrastive learning (CL) can learn generalizable feature representations and achieve state-
of-the-art performance of downstream tasks by finetuning a linear classifier on top of it …
of-the-art performance of downstream tasks by finetuning a linear classifier on top of it …
On the effectiveness of interval bound propagation for training verifiably robust models
Recent work has shown that it is possible to train deep neural networks that are provably
robust to norm-bounded adversarial perturbations. Most of these methods are based on …
robust to norm-bounded adversarial perturbations. Most of these methods are based on …
Semidefinite relaxations for certifying robustness to adversarial examples
Despite their impressive performance on diverse tasks, neural networks fail catastrophically
in the presence of adversarial inputs—imperceptibly but adversarially perturbed versions of …
in the presence of adversarial inputs—imperceptibly but adversarially perturbed versions of …