Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models

B An, S Zhu, R Zhang, MA Panaitescu-Liess… - arxiv preprint arxiv …, 2024 - arxiv.org
Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Programming refusal with conditional activation steering

BW Lee, I Padhi, KN Ramamurthy, E Miehling… - arxiv preprint arxiv …, 2024 - arxiv.org
LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …

Onegen: Efficient one-pass unified generation and retrieval for llms

J Zhang, C Peng, M Sun, X Chen, L Liang… - Findings of the …, 2024 - aclanthology.org
Despite the recent advancements in Large Language Models (LLMs), which have
significantly enhanced the generative capabilities for various NLP tasks, LLMs still face …

Sycophancy in Large Language Models: Causes and Mitigations

L Malmqvist - arxiv preprint arxiv:2411.15287, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities across a wide
range of natural language processing tasks. However, their tendency to exhibit sycophantic …

Evaluating the Prompt Steerability of Large Language Models

E Miehling, M Desmond, KN Ramamurthy… - arxiv preprint arxiv …, 2024 - arxiv.org
Building pluralistic AI requires designing models that are able to be shaped to represent a
wide range of value systems and cultures. Achieving this requires first being able to evaluate …

A Unified Understanding and Evaluation of Steering Methods

S Im, Y Li - arxiv preprint arxiv:2502.02716, 2025 - arxiv.org
Steering methods provide a practical approach to controlling large language models by
applying steering vectors to intermediate activations, guiding outputs toward desired …

Representation Tuning

CM Ackerman - arxiv preprint arxiv:2409.06927, 2024 - arxiv.org
Activation engineering is becoming increasingly popular as a means of online control of
large language models (LLMs). In this work, I extend the idea of active steering with vectors …