Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arxiv preprint arxiv …, 2024 - arxiv.org
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

Faith and fate: Limits of transformers on compositionality

N Dziri, X Lu, M Sclar, XL Li, L Jiang… - Advances in …, 2023 - proceedings.neurips.cc
Transformer large language models (LLMs) have sparked admiration for their exceptional
performance on tasks that demand intricate multi-step reasoning. Yet, these models …

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arxiv preprint arxiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization

B Wang, X Yue, Y Su, H Sun - arxiv preprint arxiv:2405.15071, 2024 - arxiv.org
We study whether transformers can learn to implicitly reason over parametric knowledge, a
skill that even the most capable language models struggle with. Focusing on two …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Benign overfitting and grokking in relu networks for xor cluster data

Z Xu, Y Wang, S Frei, G Vardi, W Hu - arxiv preprint arxiv:2310.02541, 2023 - arxiv.org
Neural networks trained by gradient descent (GD) have exhibited a number of surprising
generalization behaviors. First, they can achieve a perfect fit to noisy training data and still …

Dichotomy of early and late phase implicit biases can provably induce grokking

K Lyu, J **, Z Li, SS Du, JD Lee, W Hu - arxiv preprint arxiv:2311.18817, 2023 - arxiv.org
Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …

The heuristic core: Understanding subnetwork generalization in pretrained language models

A Bhaskar, D Friedman, D Chen - arxiv preprint arxiv:2403.03942, 2024 - arxiv.org
Prior work has found that pretrained language models (LMs) fine-tuned with different
random seeds can achieve similar in-domain performance but generalize differently on tests …

Grokking as the transition from lazy to rich training dynamics

T Kumar - 2024 - dash.harvard.edu
We study the recently discovered “grokking” phenomenon in deep learning [Power et al.,
2022], where neural networks generalize to unseen data abruptly, long after memorizing …