Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
A practical review of mechanistic interpretability for transformer-based language models
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …
understand a neural network model by reverse-engineering its internal computations …
Contextcite: Attributing model generation to context
How do language models use information provided as context when generating a
response? Can we infer whether a particular generated statement is actually grounded in …
response? Can we infer whether a particular generated statement is actually grounded in …
Attribution patching outperforms automated circuit discovery
Automated interpretability research has recently attracted attention as a potential research
direction that could scale explanations of neural network behavior to large models. Existing …
direction that could scale explanations of neural network behavior to large models. Existing …
Information flow routes: Automatically interpreting language models at scale
Information flows by routes inside the network via mechanisms implemented in the model.
These routes can be represented as graphs where nodes correspond to token …
These routes can be represented as graphs where nodes correspond to token …
Decomposing and editing predictions by modeling model computation
How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …
predictions? In this paper, we introduce a task called component modeling that aims to …
AI and the Problem of Knowledge Collapse
AJ Peterson - AI & SOCIETY, 2025 - Springer
While artificial intelligence has the potential to process vast amounts of data, generate new
insights, and unlock greater productivity, its widespread adoption may entail unforeseen …
insights, and unlock greater productivity, its widespread adoption may entail unforeseen …
Improving sparse decomposition of language model activations with gated sparse autoencoders
Recent work has found that sparse autoencoders (SAEs) are an effective technique for
unsupervised discovery of interpretable features in language models'(LMs) activations, by …
unsupervised discovery of interpretable features in language models'(LMs) activations, by …
Llm circuit analyses are consistent across training and scale
Most currently deployed large language models (LLMs) undergo continuous training or
additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on …
additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on …
Answer, assemble, ace: Understanding how transformers answer multiple choice questions
Multiple-choice question answering (MCQA) is a key competence of performant transformer
language models that is tested by mainstream benchmarks. However, recent evidence …
language models that is tested by mainstream benchmarks. However, recent evidence …