Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Kan: Kolmogorov-arnold networks

Z Liu, Y Wang, S Vaidya, F Ruehle, J Halverson… - arxiv preprint arxiv …, 2024 - arxiv.org
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold
Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs …

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Language models represent space and time

W Gurnee, M Tegmark - arxiv preprint arxiv:2310.02207, 2023 - arxiv.org
The capabilities of large language models (LLMs) have sparked debate over whether such
systems just learn an enormous collection of superficial statistics or a set of more coherent …

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Eliciting latent predictions from transformers with the tuned lens

N Belrose, Z Furman, L Smith, D Halawi… - arxiv preprint arxiv …, 2023 - arxiv.org
We analyze transformers from the perspective of iterative inference, seeking to understand
how model predictions are refined layer by layer. To do so, we train an affine probe for each …

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …

Birth of a transformer: A memory viewpoint

A Bietti, V Cabannes, D Bouchacourt… - Advances in …, 2023 - proceedings.neurips.cc
Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …

The clock and the pizza: Two stories in mechanistic explanation of neural networks

Z Zhong, Z Liu, M Tegmark… - Advances in neural …, 2023 - proceedings.neurips.cc
Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …