Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Language model behavior: A comprehensive survey
Transformer language models have received widespread public attention, yet their
generated text is often surprising even to NLP researchers. In this survey, we discuss over …
generated text is often surprising even to NLP researchers. In this survey, we discuss over …
Mass-editing memory in a transformer
Recent work has shown exciting promise in updating large language models with new
memories, so as to replace obsolete information or add specialized knowledge. However …
memories, so as to replace obsolete information or add specialized knowledge. However …
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small
Research in mechanistic interpretability seeks to explain behaviors of machine learning
models in terms of their internal components. However, most previous work either focuses …
models in terms of their internal components. However, most previous work either focuses …
Dissecting recall of factual associations in auto-regressive language models
Transformer-based language models (LMs) are known to capture factual knowledge in their
parameters. While previous work looked into where factual associations are stored, only little …
parameters. While previous work looked into where factual associations are stored, only little …
Eliciting latent predictions from transformers with the tuned lens
We analyze transformers from the perspective of iterative inference, seeking to understand
how model predictions are refined layer by layer. To do so, we train an affine probe for each …
how model predictions are refined layer by layer. To do so, we train an affine probe for each …
Birth of a transformer: A memory viewpoint
Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …
However, as they are deployed more widely, there is a growing need to better understand …
Finding neurons in a haystack: Case studies with sparse probing
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …
computations of these models remain opaque and poorly understood. In this work, we seek …
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla
\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms
of language models. However, existing analyses are done in small models far from the state …
of language models. However, existing analyses are done in small models far from the state …
A review on large Language Models: Architectures, applications, taxonomies, open issues and challenges
Large Language Models (LLMs) recently demonstrated extraordinary capability in various
natural language processing (NLP) tasks including language translation, text generation …
natural language processing (NLP) tasks including language translation, text generation …