Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …
decomposition of a neural network's latent representations into seemingly interpretable …
A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions
The remarkable performance of large language models (LLMs) in content generation,
coding, and common-sense reasoning has spurred widespread integration into many facets …
coding, and common-sense reasoning has spurred widespread integration into many facets …
Mechanistic?
N Saphra, S Wiegreffe - arxiv preprint arxiv:2410.09087, 2024 - arxiv.org
The rise of the term" mechanistic interpretability" has accompanied increasing interest in
understanding neural models--particularly language models. However, this jargon has also …
understanding neural models--particularly language models. However, this jargon has also …
Open Problems in Mechanistic Interpretability
Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …
neural networks' capabilities in order to accomplish concrete scientific and engineering …
Partially Rewriting a Transformer in Natural Language
The greatest ambition of mechanistic interpretability is to completely rewrite deep neural
networks in a format that is more amenable to human understanding, while preserving their …
networks in a format that is more amenable to human understanding, while preserving their …
Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning
Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex
problems in robotics or games, yet most of the trained models are hard to interpret. While …
problems in robotics or games, yet most of the trained models are hard to interpret. While …
Sparse Autoencoders Do Not Find Canonical Units of Analysis
A common goal of mechanistic interpretability is to decompose the activations of neural
networks into features: interpretable properties of the input computed by the model. Sparse …
networks into features: interpretable properties of the input computed by the model. Sparse …
Universal Response and Emergence of Induction in LLMs
N Luick - arxiv preprint arxiv:2411.07071, 2024 - arxiv.org
While induction is considered a key mechanism for in-context learning in LLMs,
understanding its precise circuit decomposition beyond toy models remains elusive. Here …
understanding its precise circuit decomposition beyond toy models remains elusive. Here …
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
D Laptev, N Balagansky, Y Aksenov… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce a new approach to systematically map features discovered by sparse
autoencoder across consecutive layers of large language models, extending earlier work …
autoencoder across consecutive layers of large language models, extending earlier work …