Not all language model features are linear
Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …
dimensional representations of concepts (" features") in activation space. In contrast, we …
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …
experimentation with open-source models, we observe that this ability to retrieve facts can …
All or none: Identifiable linear properties of next-token predictors in language modeling
We analyze identifiability as a possible explanation for the ubiquity of linear properties
across language models, such as the vector difference between the representations of" …
across language models, such as the vector difference between the representations of" …
Intrinsic self-correction for enhanced morality: An analysis of internal mechanisms and the superficial hypothesis
Large Language Models (LLMs) are capable of producing content that perpetuates
stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a …
stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a …
On the universal truthfulness hyperplane inside llms
While large language models (LLMs) have demonstrated remarkable abilities across
various fields, hallucination remains a significant challenge. Recent studies have explored …
various fields, hallucination remains a significant challenge. Recent studies have explored …
Causal language modeling can elicit search and reasoning capabilities on logic puzzles
Causal language modeling using the Transformer architecture has yielded remarkable
capabilities in Large Language Models (LLMs) over the last few years. However, the extent …
capabilities in Large Language Models (LLMs) over the last few years. However, the extent …
PaCE: Parsimonious Concept Engineering for Large Language Models
Large Language Models (LLMs) are being used for a wide variety of tasks. While they are
capable of generating human-like responses, they can also produce undesirable output …
capable of generating human-like responses, they can also produce undesirable output …
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning
Transformer-based large language models (LLMs) have displayed remarkable creative
prowess and emergence capabilities. Existing empirical studies have revealed a strong …
prowess and emergence capabilities. Existing empirical studies have revealed a strong …
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Understanding how semantic meaning is encoded in the representation spaces of large
language models is a fundamental problem in interpretability. In this paper, we study the two …
language models is a fundamental problem in interpretability. In this paper, we study the two …
ResiDual Transformer Alignment with Spectral Decomposition
When examined through the lens of their residual streams, a puzzling property emerges in
transformer networks: residual contributions (eg, attention heads) sometimes specialize in …
transformer networks: residual contributions (eg, attention heads) sometimes specialize in …