Not all language model features are linear

J Engels, EJ Michaud, I Liao, W Gurnee… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Y Jiang, G Rajendran, P Ravikumar… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …

All or none: Identifiable linear properties of next-token predictors in language modeling

E Marconato, S Lachapelle, S Weichwald… - arxiv preprint arxiv …, 2024 - arxiv.org
We analyze identifiability as a possible explanation for the ubiquity of linear properties
across language models, such as the vector difference between the representations of" …

Intrinsic self-correction for enhanced morality: An analysis of internal mechanisms and the superficial hypothesis

G Liu, H Mao, J Tang, KM Johnson - arxiv preprint arxiv:2407.15286, 2024 - arxiv.org
Large Language Models (LLMs) are capable of producing content that perpetuates
stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a …

On the universal truthfulness hyperplane inside llms

J Liu, S Chen, Y Cheng, J He - arxiv preprint arxiv:2407.08582, 2024 - arxiv.org
While large language models (LLMs) have demonstrated remarkable abilities across
various fields, hallucination remains a significant challenge. Recent studies have explored …

Causal language modeling can elicit search and reasoning capabilities on logic puzzles

K Shah, N Dikkala, X Wang, R Panigrahy - arxiv preprint arxiv …, 2024 - arxiv.org
Causal language modeling using the Transformer architecture has yielded remarkable
capabilities in Large Language Models (LLMs) over the last few years. However, the extent …

PaCE: Parsimonious Concept Engineering for Large Language Models

J Luo, T Ding, KHR Chan, D Thaker… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are being used for a wide variety of tasks. While they are
capable of generating human-like responses, they can also produce undesirable output …

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

D Bu, W Huang, A Han, A Nitanda, T Suzuki… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based large language models (LLMs) have displayed remarkable creative
prowess and emergence capabilities. Existing empirical studies have revealed a strong …

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

K Park, YJ Choe, Y Jiang, V Veitch - arxiv preprint arxiv:2406.01506, 2024 - arxiv.org
Understanding how semantic meaning is encoded in the representation spaces of large
language models is a fundamental problem in interpretability. In this paper, we study the two …

ResiDual Transformer Alignment with Spectral Decomposition

L Basile, V Maiorca, L Bortolussi, E Rodolà… - arxiv preprint arxiv …, 2024 - arxiv.org
When examined through the lens of their residual streams, a puzzling property emerges in
transformer networks: residual contributions (eg, attention heads) sometimes specialize in …