Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
A practical review of mechanistic interpretability for transformer-based language models
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …
understand a neural network model by reverse-engineering its internal computations …
Faith and fate: Limits of transformers on compositionality
Transformer large language models (LLMs) have sparked admiration for their exceptional
performance on tasks that demand intricate multi-step reasoning. Yet, these models …
performance on tasks that demand intricate multi-step reasoning. Yet, these models …
Towards best practices of activation patching in language models: Metrics and methods
Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …
learning models, where localization--identifying the important model components--is a key …
Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization
We study whether transformers can learn to implicitly reason over parametric knowledge, a
skill that even the most capable language models struggle with. Focusing on two …
skill that even the most capable language models struggle with. Focusing on two …
A primer on the inner workings of transformer-based language models
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …
language models has highlighted a need for contextualizing the insights gained from years …
Benign overfitting and grokking in relu networks for xor cluster data
Neural networks trained by gradient descent (GD) have exhibited a number of surprising
generalization behaviors. First, they can achieve a perfect fit to noisy training data and still …
generalization behaviors. First, they can achieve a perfect fit to noisy training data and still …
Dichotomy of early and late phase implicit biases can provably induce grokking
Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …
The heuristic core: Understanding subnetwork generalization in pretrained language models
Prior work has found that pretrained language models (LMs) fine-tuned with different
random seeds can achieve similar in-domain performance but generalize differently on tests …
random seeds can achieve similar in-domain performance but generalize differently on tests …
Grokking as the transition from lazy to rich training dynamics
T Kumar - 2024 - dash.harvard.edu
We study the recently discovered “grokking” phenomenon in deep learning [Power et al.,
2022], where neural networks generalize to unseen data abruptly, long after memorizing …
2022], where neural networks generalize to unseen data abruptly, long after memorizing …