Detoxifying Large Language Models via Knowledge Editing
This paper investigates using knowledge editing techniques to detoxify Large Language
Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories …
Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories …
High-Dimension Human Value Representation in Large Language Models
The widespread application of Large Language Models (LLMs) across various tasks and
fields has necessitated the alignment of these models with human values and preferences …
fields has necessitated the alignment of these models with human values and preferences …
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts,
which were easily detectable by automated methods. We address this gap by focusing on …
which were easily detectable by automated methods. We address this gap by focusing on …
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context
Previous research on testing the vulnerabilities in Large Language Models (LLMs) using
adversarial attacks has primarily focused on nonsensical prompt injections, which are easily …
adversarial attacks has primarily focused on nonsensical prompt injections, which are easily …
Detoxifying Large Language Models via Kahneman-Tversky Optimization
Q Li, W Du, J Liu - CCF International Conference on Natural Language …, 2024 - Springer
Currently, the application of Large Language Models (LLMs) faces significant security
threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate …
threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate …