- Academic Search

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org

Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

Save Cite Cited by 11 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin… - arxiv preprint arxiv …, 2024 - openreview.net

Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

Save Cite Cited by 12 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] ieee.org

An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment

Y Zheng, CH Chang, SH Huang… - IEEE Journal on …, 2024 - ieeexplore.ieee.org

AI has undergone a remarkable evolution journey marked by groundbreaking milestones.
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

G Liu, W Lin, T Huang, R Mo, Q Mu, L Shen - arxiv preprint arxiv …, 2024 - arxiv.org

Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …

Save Cite Cited by 5 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Programming refusal with conditional activation steering

BW Lee, I Padhi, KN Ramamurthy, E Miehling… - arxiv preprint arxiv …, 2024 - arxiv.org

LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Gradient routing: Masking gradients to localize computation in neural networks

A Cloud, J Goldman-Wetzler, E Wybitul, J Miller… - arxiv preprint arxiv …, 2024 - arxiv.org

Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection

H Shen, PY Chen, P Das, T Chen - arxiv preprint arxiv:2410.07471, 2024 - arxiv.org

Fine-tuning on task-specific data to boost downstream performance is a crucial step for
leveraging Large Language Models (LLMs). However, previous studies have demonstrated …

Save Cite Cited by 5 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, L Liu - The Thirty-eighth Annual Conference on …, 2024 - openreview.net

The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …

Save Cite Cited by 3 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

D Wu, X Lu, Y Zhao, B Qin - arxiv preprint arxiv:2412.11041, 2024 - arxiv.org

Although large language models (LLMs) achieve effective safety alignment at the time of
release, they still face various safety challenges. A key issue is that fine-tuning often …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

M Li, WM Si, M Backes, Y Zhang, Y Wang - arxiv preprint arxiv …, 2025 - arxiv.org

As advancements in large language models (LLMs) continue and the demand for
personalized models increases, parameter-efficient fine-tuning (PEFT) methods (eg, LoRA) …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Harmful fine-tuning attacks and defenses for large language models: A survey

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

Programming refusal with conditional activation steering

Gradient routing: Masking gradients to localize computation in neural networks

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation