- Academic Search

B An, S Zhu, R Zhang, MA Panaitescu-Liess… - arxiv preprint arxiv …, 2024 - arxiv.org

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …

Zapisz Cytuj Cytowane przez 8 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Programming refusal with conditional activation steering

BW Lee, I Padhi, KN Ramamurthy, E Miehling… - arxiv preprint arxiv …, 2024 - arxiv.org

LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Onegen: Efficient one-pass unified generation and retrieval for llms

J Zhang, C Peng, M Sun, X Chen, L Liang… - Findings of the …, 2024 - aclanthology.org

Despite the recent advancements in Large Language Models (LLMs), which have
significantly enhanced the generative capabilities for various NLP tasks, LLMs still face …

Zapisz Cytuj Cytowane przez 1 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sycophancy in Large Language Models: Causes and Mitigations

L Malmqvist - arxiv preprint arxiv:2411.15287, 2024 - arxiv.org

Large language models (LLMs) have demonstrated remarkable capabilities across a wide
range of natural language processing tasks. However, their tendency to exhibit sycophantic …

Zapisz Cytuj Cytowane przez 1 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating the Prompt Steerability of Large Language Models

E Miehling, M Desmond, KN Ramamurthy… - arxiv preprint arxiv …, 2024 - arxiv.org

Building pluralistic AI requires designing models that are able to be shaped to represent a
wide range of value systems and cultures. Achieving this requires first being able to evaluate …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A Unified Understanding and Evaluation of Steering Methods

S Im, Y Li - arxiv preprint arxiv:2502.02716, 2025 - arxiv.org

Steering methods provide a practical approach to controlling large language models by
applying steering vectors to intermediate activations, guiding outputs toward desired …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Representation Tuning

CM Ackerman - arxiv preprint arxiv:2409.06927, 2024 - arxiv.org

Activation engineering is becoming increasingly popular as a means of online control of
large language models (LLMs). In this work, I extend the idea of active steering with vectors …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 3 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Steering without side effects: Improving post-deployment control of language models

Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models

Robust LLM safeguarding via refusal feature adversarial training

Programming refusal with conditional activation steering

Onegen: Efficient one-pass unified generation and retrieval for llms

Sycophancy in Large Language Models: Causes and Mitigations

Evaluating the Prompt Steerability of Large Language Models

A Unified Understanding and Evaluation of Steering Methods

Representation Tuning