Uniform manifold approximation and projection

J Healy, L McInnes - Nature Reviews Methods Primers, 2024 - nature.com
Uniform manifold approximation and projection is a nonlinear dimension reduction method
often used for visualizing data and as pre-processing for further machine-learning tasks …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - Advances in …, 2025 - proceedings.neurips.cc
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

T Wu, W Yuan, O Golovneva, J Xu, Y Tian, J Jiao… - arxiv preprint arxiv …, 2024 - rivista.ai
ABSTRACT Large Language Models (LLMs) are rapidly surpassing human knowledge in
many domains. While improving these models traditionally relies on costly human data …

[HTML][HTML] When llms meet cybersecurity: A systematic literature review

J Zhang, H Bu, H Wen, Y Liu, H Fei… - …, 2025 - cybersecurity.springeropen.com
The rapid development of large language models (LLMs) has opened new avenues across
various fields, including cybersecurity, which faces an evolving threat landscape and …

Towards principled evaluations of sparse autoencoders for interpretability and control

A Makelov, G Lange, N Nanda - arxiv preprint arxiv:2405.08366, 2024 - arxiv.org
Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …

Transcoders find interpretable llm feature circuits

J Dunefsky, P Chlenski… - Advances in Neural …, 2025 - proceedings.neurips.cc
A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of
models corresponding to specific behaviors or capabilities. However, MLP sublayers make …

Large language models can be used to estimate the latent positions of politicians

PY Wu, J Nagler, JA Tucker, S Messing - arxiv preprint arxiv:2303.12057, 2023 - arxiv.org
Existing approaches to estimating politicians' latent positions along specific dimensions
often fail when relevant data is limited. We leverage the embedded knowledge in generative …