Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Opening the black box of large language models: Two views on holistic interpretability

H Zhao, F Yang, H Lakkaraju, M Du - arxiv e-prints, 2024 - ui.adsabs.harvard.edu
As large language models (LLMs) grow more powerful, concerns around potential harms
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Q Zhang, Y Wang, J Cui, X Pan, Q Lei… - arxiv preprint arxiv …, 2024 - arxiv.org
Deep learning models often suffer from a lack of interpretability due to polysemanticity,
where individual neurons are activated by multiple unrelated semantics, resulting in unclear …

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

MU Haider, H Rizwan, H Sajjad, P Ju… - arxiv preprint arxiv …, 2025 - arxiv.org
Interpreting and controlling the internal mechanisms of large language models (LLMs) is
crucial for improving their trustworthiness and utility. Recent efforts have primarily focused …

SAFR: Neuron Redistribution for Interpretability

R Chang, C Deng, H Chen - arxiv preprint arxiv:2501.16374, 2025 - arxiv.org
Superposition refers to encoding representations of multiple features within a single neuron,
which is common in transformers. This property allows neurons to combine and represent …

NEURONAL ENTANGLEMENT, AND SPARSITY

W DISTANCES - openreview.net
Disentangling polysemantic neurons is at the core of many current approaches to
interpretability of large language models. Here we attempt to study how disentanglement …