Large language models and causal inference in collaboration: A comprehensive survey

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Interpretability at scale: Identifying causal mechanisms in alpaca

Z Wu, A Geiger, T Icard, C Potts… - Advances in Neural …, 2023 - proceedings.neurips.cc
Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Learning transformer programs

D Friedman, A Wettig, D Chen - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent research in mechanistic interpretability has attempted to reverse-engineer
Transformer models by carefully inspecting network weights and activations. However, these …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arxiv preprint arxiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

Rigorously assessing natural language explanations of neurons

J Huang, A Geiger, K D'Oosterlinck, Z Wu… - arxiv preprint arxiv …, 2023 - arxiv.org
Natural language is an appealing medium for explaining how large language models
process and store information, but evaluating the faithfulness of such explanations is …

Reft: Representation finetuning for language models

Z Wu, A Arora, Z Wang, A Geiger, D Jurafsky… - arxiv preprint arxiv …, 2024 - arxiv.org
Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a
small number of weights. However, much prior interpretability work has shown that …

Localizing model behavior with path patching

N Goldowsky-Dill, C MacLeod, L Sato… - arxiv preprint arxiv …, 2023 - arxiv.org
Localizing behaviors of neural networks to a subset of the network's components or a subset
of interactions between components is a natural first step towards analyzing network …