Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
Scaling and evaluating sparse autoencoders
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …
interpretable features from a language model by reconstructing activations from a sparse …
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …
decomposition of a neural network's latent representations into seemingly interpretable …
Towards principled evaluations of sparse autoencoders for interpretability and control
Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …
interpretability. However, the absence of ground-truth for these features in realistic scenarios …
A primer on the inner workings of transformer-based language models
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …
language models has highlighted a need for contextualizing the insights gained from years …
Relational composition in neural networks: A survey and call to action
Many neural nets appear to represent data as linear combinations of" feature vectors."
Algorithms for discovering these vectors have seen impressive recent success. However, we …
Algorithms for discovering these vectors have seen impressive recent success. However, we …
Mathematical models of computation in superposition
Superposition--when a neural network represents more``features''than it has dimensions--
seems to pose a serious challenge to mechanistically interpreting current AI systems …
seems to pose a serious challenge to mechanistically interpreting current AI systems …
Sparse autoencoders reveal universal feature spaces across large language models
We investigate feature universality in large language models (LLMs), a research field that
aims to understand how different models similarly represent concepts in the latent spaces of …
aims to understand how different models similarly represent concepts in the latent spaces of …
Disentangling dense embeddings with sparse autoencoders
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from
complex neural networks. We present one of the first applications of SAEs to dense text …
complex neural networks. We present one of the first applications of SAEs to dense text …
Explaining AI through mechanistic interpretability
Recent work in explainable artificial intelligence (XAI) attempts to render opaque AI systems
understandable through a divide-and-conquer strategy. However, this fails to illuminate how …
understandable through a divide-and-conquer strategy. However, this fails to illuminate how …