Large language models and causal inference in collaboration: A comprehensive survey
Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …
robustness, and explainability of Natural Language Processing (NLP) models by capturing …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Interpretability at scale: Identifying causal mechanisms in alpaca
Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …
an urgent goal for AI safety. However, it is just as important that our interpretability methods …
Refusal in language models is mediated by a single direction
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
Learning transformer programs
Recent research in mechanistic interpretability has attempted to reverse-engineer
Transformer models by carefully inspecting network weights and activations. However, these …
Transformer models by carefully inspecting network weights and activations. However, these …
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …
This review explores mechanistic interpretability: reverse-engineering the computational …
Towards best practices of activation patching in language models: Metrics and methods
Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …
learning models, where localization--identifying the important model components--is a key …
Rigorously assessing natural language explanations of neurons
Natural language is an appealing medium for explaining how large language models
process and store information, but evaluating the faithfulness of such explanations is …
process and store information, but evaluating the faithfulness of such explanations is …
Reft: Representation finetuning for language models
Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a
small number of weights. However, much prior interpretability work has shown that …
small number of weights. However, much prior interpretability work has shown that …
Localizing model behavior with path patching
Localizing behaviors of neural networks to a subset of the network's components or a subset
of interactions between components is a natural first step towards analyzing network …
of interactions between components is a natural first step towards analyzing network …