Google Acadèmic

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Desa Cita Citat per 84 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Emergence of hidden capabilities: Exploring learning dynamics in concept space

CF Park, M Okawa, A Lee… - Advances in Neural …, 2025 - proceedings.neurips.cc

Modern generative models demonstrate impressive capabilities, likely stemming from an
ability to identify and manipulate abstract concepts underlying their training data. However …

Desa Cita Citat per 6 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] rug.nl

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl

The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Desa Cita Citat per 43 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Preference tuning for toxicity mitigation generalizes across languages

X Li, ZX Yong, SH Bach - arxiv preprint arxiv:2406.16235, 2024 - arxiv.org

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …

Desa Cita Citat per 10 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models

B An, S Zhu, R Zhang, MA Panaitescu-Liess… - arxiv preprint arxiv …, 2024 - arxiv.org

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …

Desa Cita Citat per 10 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Latent adversarial training improves robustness to persistent harmful behaviors in llms

A Sheshadri, A Ewart, P Guo, A Lynch, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) can often be made to behave in undesirable ways that they
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …

Desa Cita Citat per 8 Articles relacionats Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

What makes and breaks safety fine-tuning? a mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arxiv preprint arxiv …, 2024 - arxiv.org

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

Desa Cita Citat per 9 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On evaluating the durability of safeguards for open-weight llms

X Qi, B Wei, N Carlini, Y Huang, T **e, L He… - arxiv preprint arxiv …, 2024 - arxiv.org

Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …

Desa Cita Citat per 5 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Desa Cita Citat per 4 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Understanding jailbreak success: A study of latent space dynamics in large language models

S Ball, F Kreuter, N Panickssery - arxiv preprint arxiv:2406.09289, 2024 - arxiv.org

Conversational large language models are trained to refuse to answer harmful questions.
However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an …

Desa Cita Citat per 8 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

Crea una alerta

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Refusal in language models is mediated by a single direction

Mechanistic Interpretability for AI Safety--A Review

Emergence of hidden capabilities: Exploring learning dynamics in concept space

A primer on the inner workings of transformer-based language models

Preference tuning for toxicity mitigation generalizes across languages

Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models

Latent adversarial training improves robustness to persistent harmful behaviors in llms

What makes and breaks safety fine-tuning? a mechanistic study

On evaluating the durability of safeguards for open-weight llms

Robust LLM safeguarding via refusal feature adversarial training

Understanding jailbreak success: A study of latent space dynamics in large language models