Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors

Z Zhang, Y Lu, J Ma, D Zhang, R Li, P Ke, H Sun… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The safety of Large Language Models (LLMs) has gained increasing attention in recent
years, but there still lacks a comprehensive approach for detecting safety issues within …

Detoxifying large language models via knowledge editing

M Wang, N Zhang, Z Xu, Z **, S Deng, Y Yao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
This paper investigates using knowledge editing techniques to detoxify Large Language
Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories …

High-dimension human value representation in large language models

S Cahyawijaya, D Chen, Y Bang, L Khalatbari… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The widespread application of Large Language Models (LLMs) across various tasks and
fields has necessitated the alignment of these models with human values and preferences …

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

N Das, E Raff, M Gaur - arxiv preprint arxiv:2412.16359, 2024‏ - arxiv.org
Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts,
which were easily detectable by automated methods. We address this gap by focusing on …

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

N Das, E Raff, M Gaur - arxiv preprint arxiv:2407.14644, 2024‏ - arxiv.org
Previous research on testing the vulnerabilities in Large Language Models (LLMs) using
adversarial attacks has primarily focused on nonsensical prompt injections, which are easily …

Detoxifying Large Language Models via Kahneman-Tversky Optimization

Q Li, W Du, J Liu - CCF International Conference on Natural Language …, 2024‏ - Springer
Currently, the application of Large Language Models (LLMs) faces significant security
threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate …