Harmful fine-tuning attacks and defenses for large language models: A survey
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation
Harmful fine-tuning issue\citep {qi2023fine} poses serious safety concerns for Large
language models' fine-tuning-as-a-service. While existing defenses\citep …
language models' fine-tuning-as-a-service. While existing defenses\citep …
Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
Backtracking improves generation safety
Text generation has a fundamental limitation almost by definition: there is no taking back
tokens that have been generated, even when they are clearly problematic. In the context of …
tokens that have been generated, even when they are clearly problematic. In the context of …
On evaluating the durability of safeguards for open-weight llms
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …
of large language models (LLMs). An open challenge to this goal is whether technical …
Position: Llm unlearning benchmarks are weak measures of progress
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …
Meta-unlearning on diffusion models: Preventing relearning unlearned concepts
With the rapid progress of diffusion-based content generation, significant efforts are being
made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to …
made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to …
Oml: Open, monetizable, and loyal ai
Artificial Intelligence (AI) has steadily improved across a wide range of tasks. However, the
development and deployment of AI are almost entirely controlled by a few powerful …
development and deployment of AI are almost entirely controlled by a few powerful …
A Closer Look at Machine Unlearning for Large Language Models
Large language models (LLMs) may memorize sensitive or copyrighted content, raising
privacy and legal concerns. Due to the high cost of retraining from scratch, researchers …
privacy and legal concerns. Due to the high cost of retraining from scratch, researchers …