An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

J Rando, J Zhang, N Carlini, F Tramèr - arxiv preprint arxiv:2502.02260, 2025 - arxiv.org
In the past decade, considerable research effort has been devoted to securing machine
learning (ML) models that operate in adversarial settings. Yet, progress has been slow even …

CopyrightMeter: Revisiting Copyright Protection in Text-to-image Models

N Xu, C Li, T Du, M Li, W Luo, J Liang, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-to-image diffusion models have emerged as powerful tools for generating high-quality
images from textual descriptions. However, their increasing popularity has raised significant …

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Y Zeng, Y Cao, B Cao, Y Chang, J Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in diffusion models have significantly enhanced the quality of image
synthesis, yet they have also introduced serious safety concerns, particularly the generation …

Revisiting the Robust Alignment of Circuit Breakers

L Schwinn, S Geisler - arxiv preprint arxiv:2407.15902, 2024 - arxiv.org
Over the past decade, adversarial training has emerged as one of the few reliable methods
for enhancing model robustness against adversarial attacks [Szegedy et al., 2014, Madry et …

[PDF][PDF] SongBsAb: A Dual Prevention Approach against Singing Voice Conversion based Illegal Song Covers

G Chen, Y Zhang - CoRR, vol. abs/2401.17133, 2024 - researchgate.net
Singing voice conversion (SVC) automates song covers by converting a source singing
voice from a source singer into a new singing voice with the same lyrics and melody as the …

Certifiable AI Security against Localized Corruption Attacks

C **ang - 2025 - search.proquest.com
Building secure and robust AI models has proven to be difficult. Nearly all defenses,
including those published at top-tier venues and recognized with prestigious awards, can be …

AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption

J Jeon, WJ Kim, S Ha, S Son, S Yoon - The Thirteenth International … - openreview.net
The outstanding capability of diffusion models in generating high-quality images poses
significant threats when misused by adversaries. In particular, we assume malicious …