- Academic Search

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

Save Cite Cited by 60 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Save Cite Cited by 120 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Interpretability at scale: Identifying causal mechanisms in alpaca

Z Wu, A Geiger, T Icard, C Potts… - Advances in Neural …, 2023 - proceedings.neurips.cc

Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …

Save Cite Cited by 86 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Save Cite Cited by 54 Related articles View as HTML

[Free GPT-4]

[PDF] neurips.cc

Learning transformer programs

D Friedman, A Wettig, D Chen - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent research in mechanistic interpretability has attempted to reverse-engineer
Transformer models by carefully inspecting network weights and activations. However, these …

Save Cite Cited by 42 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Save Cite Cited by 73 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arxiv preprint arxiv:2309.16042, 2023 - arxiv.org

Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

Save Cite Cited by 60 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Rigorously assessing natural language explanations of neurons

J Huang, A Geiger, K D'Oosterlinck, Z Wu… - arxiv preprint arxiv …, 2023 - arxiv.org

Natural language is an appealing medium for explaining how large language models
process and store information, but evaluating the faithfulness of such explanations is …

Save Cite Cited by 29 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Reft: Representation finetuning for language models

Z Wu, A Arora, Z Wang, A Geiger, D Jurafsky… - arxiv preprint arxiv …, 2024 - arxiv.org

Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a
small number of weights. However, much prior interpretability work has shown that …

Save Cite Cited by 43 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Localizing model behavior with path patching

N Goldowsky-Dill, C MacLeod, L Sato… - arxiv preprint arxiv …, 2023 - arxiv.org

Localizing behaviors of neural networks to a subset of the network's components or a subset
of interactions between components is a natural first step towards analyzing network …

Save Cite Cited by 50 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Finding alignments between interpretable causal variables and distributed neural representations

Large language models and causal inference in collaboration: A comprehensive survey

Foundational challenges in assuring alignment and safety of large language models

Interpretability at scale: Identifying causal mechanisms in alpaca

Refusal in language models is mediated by a single direction

Learning transformer programs

Mechanistic Interpretability for AI Safety--A Review

Towards best practices of activation patching in language models: Metrics and methods

Rigorously assessing natural language explanations of neurons

Reft: Representation finetuning for language models

Localizing model behavior with path patching