Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …
This review explores mechanistic interpretability: reverse-engineering the computational …
Opening the black box of large language models: Two views on holistic interpretability
As large language models (LLMs) grow more powerful, concerns around potential harms
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Deep learning models often suffer from a lack of interpretability due to polysemanticity,
where individual neurons are activated by multiple unrelated semantics, resulting in unclear …
where individual neurons are activated by multiple unrelated semantics, resulting in unclear …
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Interpreting and controlling the internal mechanisms of large language models (LLMs) is
crucial for improving their trustworthiness and utility. Recent efforts have primarily focused …
crucial for improving their trustworthiness and utility. Recent efforts have primarily focused …
SAFR: Neuron Redistribution for Interpretability
Superposition refers to encoding representations of multiple features within a single neuron,
which is common in transformers. This property allows neurons to combine and represent …
which is common in transformers. This property allows neurons to combine and represent …
NEURONAL ENTANGLEMENT, AND SPARSITY
W DISTANCES - openreview.net
Disentangling polysemantic neurons is at the core of many current approaches to
interpretability of large language models. Here we attempt to study how disentanglement …
interpretability of large language models. Here we attempt to study how disentanglement …