Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Fine-tuning aligned language models compromises safety, even when users do not intend to!
Optimizing large language models (LLMs) for downstream use cases often involves the
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Refusal in language models is mediated by a single direction
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
Muse: Machine unlearning six-way evaluation for language models
Language models (LMs) are trained on vast amounts of text data, which may include private
and copyrighted content. Data owners may request the removal of their data from a trained …
and copyrighted content. Data owners may request the removal of their data from a trained …
Defending against unforeseen failure modes with latent adversarial training
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit
harmful unintended behaviors. Finding and fixing these is challenging because the attack …
harmful unintended behaviors. Finding and fixing these is challenging because the attack …
An adversarial perspective on machine unlearning for ai safety
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …
these protections can often be bypassed. Unlearning methods aim at completely removing …
Safety alignment should be made more than just a few tokens deep
The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively
simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that …
simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that …
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …
Evaluating copyright takedown methods for language models
Language models (LMs) derive their capabilities from extensive training on diverse data,
including potentially copyrighted material. These models can memorize and generate …
including potentially copyrighted material. These models can memorize and generate …
Decoding compressed trust: Scrutinizing the trustworthiness of efficient llms under compression
Compressing high-capability Large Language Models (LLMs) has emerged as a favored
strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods …
strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods …