From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai
The rising popularity of explainable artificial intelligence (XAI) to understand high-performing
black boxes raised the question of how to evaluate explanations of machine learning (ML) …
black boxes raised the question of how to evaluate explanations of machine learning (ML) …
Post-hoc interpretability for neural nlp: A survey
Neural networks for NLP are becoming increasingly complex and widespread, and there is a
growing concern if these models are responsible to use. Explaining models helps to address …
growing concern if these models are responsible to use. Explaining models helps to address …
Towards automated circuit discovery for mechanistic interpretability
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …
Explainability for large language models: A survey
Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …
language processing. However, their internal mechanisms are still unclear and this lack of …
Language in a bottle: Language model guided concept bottlenecks for interpretable image classification
Abstract Concept Bottleneck Models (CBM) are inherently interpretable models that factor
model decisions into human-readable concepts. They allow people to easily understand …
model decisions into human-readable concepts. They allow people to easily understand …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small
Research in mechanistic interpretability seeks to explain behaviors of machine learning
models in terms of their internal components. However, most previous work either focuses …
models in terms of their internal components. However, most previous work either focuses …
Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models
Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …
and recent work localizes this information to specific model weights like mid-layer MLP …
Transformer feed-forward layers are key-value memories
Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role
in the network remains under-explored. We show that feed-forward layers in transformer …
in the network remains under-explored. We show that feed-forward layers in transformer …
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …