A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions

O Shorinwa, Z Mei, J Lidard, AZ Ren… - arxiv preprint arxiv …, 2024 - arxiv.org
The remarkable performance of large language models (LLMs) in content generation,
coding, and common-sense reasoning has spurred widespread integration into many facets …

Are you still on track!? Catching LLM Task Drift with Activations

S Abdelnabi, A Fay, G Cherubin, A Salem… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models are commonly used in retrieval-augmented applications to execute
user instructions based on data from external sources. For example, modern search engines …

Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders

V Surkov, C Wendler, M Terekhov… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of
large-language models (LLMs). For LLMs, they have been shown to decompose …

Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small

M Chaudhary, A Geiger - arxiv preprint arxiv:2409.04478, 2024 - arxiv.org
A popular new method in mechanistic interpretability is to train high-dimensional sparse
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …

Sparse autoencoders reveal universal feature spaces across large language models

M Lan, P Torr, A Meek, A Khakzar, D Krueger… - arxiv preprint arxiv …, 2024 - arxiv.org
We investigate feature universality in large language models (LLMs), a research field that
aims to understand how different models similarly represent concepts in the latent spaces of …

What makes your model a low-empathy or warmth person: Exploring the origins of personality in llms

S Yang, S Zhu, R Bao, L Liu, Y Cheng, L Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities in generating
human-like text and exhibiting personality traits similar to those in humans. However, the …

Applying sparse autoencoders to unlearn knowledge in language models

E Farrell, YT Lau, A Conmy - arxiv preprint arxiv:2410.19278, 2024 - arxiv.org
We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge
from language models. We use the biology subset of the Weapons of Mass Destruction …

Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders

Z He, W Shu, X Ge, L Chen, J Wang, Y Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for
extracting sparse representations from language models, yet scalable training remains a …

Improving steering vectors by targeting sparse autoencoder features

S Chalnev, M Siu, A Conmy - arxiv preprint arxiv:2411.02193, 2024 - arxiv.org
To control the behavior of language models, steering methods attempt to ensure that outputs
of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a …