Counterfactual explanations and how to find them: literature review and benchmarking
R Guidotti - Data Mining and Knowledge Discovery, 2024 - Springer
Interpretable machine learning aims at unveiling the reasons behind predictions returned by
uninterpretable classifiers. One of the most valuable types of explanation consists of …
uninterpretable classifiers. One of the most valuable types of explanation consists of …
Adversarial machine learning for network intrusion detection systems: A comprehensive survey
Network-based Intrusion Detection System (NIDS) forms the frontline defence against
network attacks that compromise the security of the data, systems, and networks. In recent …
network attacks that compromise the security of the data, systems, and networks. In recent …
Universal and transferable adversarial attacks on aligned language models
Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …
objectionable content, recent work has focused on aligning these models in an attempt to …
Glaze: Protecting artists from style mimicry by {Text-to-Image} models
Recent text-to-image diffusion models such as MidJourney and Stable Diffusion threaten to
displace many in the professional artist community. In particular, models can learn to mimic …
displace many in the professional artist community. In particular, models can learn to mimic …
Certifying llm safety against adversarial prompting
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious
tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce …
tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce …
Visual adversarial examples jailbreak aligned large language models
Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …
Recently, there has been a surge of interest in integrating vision into Large Language …
Improving robustness using generated data
Recent work argues that robust training requires substantially larger datasets than those
required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a …
required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a …
Data augmentation can improve robustness
Adversarial training suffers from robust overfitting, a phenomenon where the robust test
accuracy starts to decrease during training. In this paper, we focus on reducing robust …
accuracy starts to decrease during training. In this paper, we focus on reducing robust …
Lira: Learnable, imperceptible and robust backdoor attacks
Recently, machine learning models have demonstrated to be vulnerable to backdoor
attacks, primarily due to the lack of transparency in black-box models such as deep neural …
attacks, primarily due to the lack of transparency in black-box models such as deep neural …
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks
The field of defense strategies against adversarial attacks has significantly grown over the
last years, but progress is hampered as the evaluation of adversarial defenses is often …
last years, but progress is hampered as the evaluation of adversarial defenses is often …