From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai

M Nauta, J Trienes, S Pathak, E Nguyen… - ACM Computing …, 2023 - dl.acm.org
The rising popularity of explainable artificial intelligence (XAI) to understand high-performing
black boxes raised the question of how to evaluate explanations of machine learning (ML) …

Post-hoc interpretability for neural nlp: A survey

A Madsen, S Reddy, S Chandar - ACM Computing Surveys, 2022 - dl.acm.org
Neural networks for NLP are becoming increasingly complex and widespread, and there is a
growing concern if these models are responsible to use. Explaining models helps to address …

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Explainability for large language models: A survey

H Zhao, H Chen, F Yang, N Liu, H Deng, H Cai… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Y Yang, A Panagopoulou, S Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Concept Bottleneck Models (CBM) are inherently interpretable models that factor
model decisions into human-readable concepts. They allow people to easily understand …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arxiv preprint arxiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

K Wang, A Variengien, A Conmy, B Shlegeris… - arxiv preprint arxiv …, 2022 - arxiv.org
Research in mechanistic interpretability seeks to explain behaviors of machine learning
models in terms of their internal components. However, most previous work either focuses …

Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models

P Hase, M Bansal, B Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …

Transformer feed-forward layers are key-value memories

M Geva, R Schuster, J Berant, O Levy - arxiv preprint arxiv:2012.14913, 2020 - arxiv.org
Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role
in the network remains under-explored. We show that feed-forward layers in transformer …

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …