Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models
Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …
Robust LLM safeguarding via refusal feature adversarial training
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …
responses. Defending against such attacks remains challenging due to the opacity of …
Programming refusal with conditional activation steering
LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …
remains challenging. Existing activation steering methods alter LLM behavior …
Onegen: Efficient one-pass unified generation and retrieval for llms
Despite the recent advancements in Large Language Models (LLMs), which have
significantly enhanced the generative capabilities for various NLP tasks, LLMs still face …
significantly enhanced the generative capabilities for various NLP tasks, LLMs still face …
Sycophancy in Large Language Models: Causes and Mitigations
L Malmqvist - arxiv preprint arxiv:2411.15287, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities across a wide
range of natural language processing tasks. However, their tendency to exhibit sycophantic …
range of natural language processing tasks. However, their tendency to exhibit sycophantic …
Evaluating the Prompt Steerability of Large Language Models
Building pluralistic AI requires designing models that are able to be shaped to represent a
wide range of value systems and cultures. Achieving this requires first being able to evaluate …
wide range of value systems and cultures. Achieving this requires first being able to evaluate …
A Unified Understanding and Evaluation of Steering Methods
S Im, Y Li - arxiv preprint arxiv:2502.02716, 2025 - arxiv.org
Steering methods provide a practical approach to controlling large language models by
applying steering vectors to intermediate activations, guiding outputs toward desired …
applying steering vectors to intermediate activations, guiding outputs toward desired …
Representation Tuning
CM Ackerman - arxiv preprint arxiv:2409.06927, 2024 - arxiv.org
Activation engineering is becoming increasingly popular as a means of online control of
large language models (LLMs). In this work, I extend the idea of active steering with vectors …
large language models (LLMs). In this work, I extend the idea of active steering with vectors …