Google Acadèmic

X Qi, Y Zeng, T **e, PY Chen, R Jia, P Mittal… - arxiv preprint arxiv …, 2023 - arxiv.org

Optimizing large language models (LLMs) for downstream use cases often involves the
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …

Desa Cita Citat per 462 Articles relacionats Totes les 7 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Desa Cita Citat per 134 Articles relacionats Totes les 7 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Desa Cita Citat per 67 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Muse: Machine unlearning six-way evaluation for language models

W Shi, J Lee, Y Huang, S Malladi, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

Language models (LMs) are trained on vast amounts of text data, which may include private
and copyrighted content. Data owners may request the removal of their data from a trained …

Desa Cita Citat per 33 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Defending against unforeseen failure modes with latent adversarial training

S Casper, L Schulze, O Patel… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit
harmful unintended behaviors. Finding and fixing these is challenging because the attack …

Desa Cita Citat per 26 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Desa Cita Citat per 16 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Safety alignment should be made more than just a few tokens deep

X Qi, A Panda, K Lyu, X Ma, S Roy, A Beirami… - arxiv preprint arxiv …, 2024 - arxiv.org

The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively
simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that …

Desa Cita Citat per 35 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] jair.org Full View

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org

Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Desa Cita Citat per 14 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating copyright takedown methods for language models

B Wei, W Shi, Y Huang, NA Smith, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Language models (LMs) derive their capabilities from extensive training on diverse data,
including potentially copyrighted material. These models can memorize and generate …

Desa Cita Citat per 14 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Decoding compressed trust: Scrutinizing the trustworthiness of efficient llms under compression

J Hong, J Duan, C Zhang, Z Li, C **e… - arxiv preprint arxiv …, 2024 - arxiv.org

Compressing high-capability Large Language Models (LLMs) has emerged as a favored
strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods …

Desa Cita Citat per 15 Articles relacionats Totes les 10 versions Free GPT-4 DeepSeek Versió HTML

Crea una alerta

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Fine-tuning aligned language models compromises safety, even when users do not intend to!

Foundational challenges in assuring alignment and safety of large language models

Refusal in language models is mediated by a single direction

Muse: Machine unlearning six-way evaluation for language models

Defending against unforeseen failure modes with latent adversarial training

An adversarial perspective on machine unlearning for ai safety

Safety alignment should be made more than just a few tokens deep

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Evaluating copyright takedown methods for language models

Decoding compressed trust: Scrutinizing the trustworthiness of efficient llms under compression