Študovňa Google

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Uložiť Citovať Citované 94-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arxiv preprint arxiv …, 2024 - arxiv.org

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

Uložiť Citovať Citované 26-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Faith and fate: Limits of transformers on compositionality

N Dziri, X Lu, M Sclar, XL Li, L Jiang… - Advances in …, 2023 - proceedings.neurips.cc

Transformer large language models (LLMs) have sparked admiration for their exceptional
performance on tasks that demand intricate multi-step reasoning. Yet, these models …

Uložiť Citovať Citované 339-krát Súvisiace články Všetky verzie 8 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arxiv preprint arxiv:2309.16042, 2023 - arxiv.org

Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

Uložiť Citovať Citované 68-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization

B Wang, X Yue, Y Su, H Sun - arxiv preprint arxiv:2405.15071, 2024 - arxiv.org

We study whether transformers can learn to implicitly reason over parametric knowledge, a
skill that even the most capable language models struggle with. Focusing on two …

Uložiť Citovať Citované 20-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] rug.nl

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl

The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Uložiť Citovať Citované 47-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Benign overfitting and grokking in relu networks for xor cluster data

Z Xu, Y Wang, S Frei, G Vardi, W Hu - arxiv preprint arxiv:2310.02541, 2023 - arxiv.org

Neural networks trained by gradient descent (GD) have exhibited a number of surprising
generalization behaviors. First, they can achieve a perfect fit to noisy training data and still …

Uložiť Citovať Citované 28-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dichotomy of early and late phase implicit biases can provably induce grokking

K Lyu, J **, Z Li, SS Du, JD Lee, W Hu - arxiv preprint arxiv:2311.18817, 2023 - arxiv.org

Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …

Uložiť Citovať Citované 26-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The heuristic core: Understanding subnetwork generalization in pretrained language models

A Bhaskar, D Friedman, D Chen - arxiv preprint arxiv:2403.03942, 2024 - arxiv.org

Prior work has found that pretrained language models (LMs) fine-tuned with different
random seeds can achieve similar in-domain performance but generalize differently on tests …

Uložiť Citovať Citované 7-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] harvard.edu

Grokking as the transition from lazy to rich training dynamics

T Kumar - 2024 - dash.harvard.edu

We study the recently discovered “grokking” phenomenon in deep learning [Power et al.,
2022], where neural networks generalize to unseen data abruptly, long after memorizing …

Uložiť Citovať Citované 30-krát Súvisiace články Všetky verzie 9 HTML verzia

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Explaining grokking through circuit efficiency

Mechanistic Interpretability for AI Safety--A Review

A practical review of mechanistic interpretability for transformer-based language models

Faith and fate: Limits of transformers on compositionality

Towards best practices of activation patching in language models: Metrics and methods

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization

A primer on the inner workings of transformer-based language models

Benign overfitting and grokking in relu networks for xor cluster data

Dichotomy of early and late phase implicit biases can provably induce grokking

The heuristic core: Understanding subnetwork generalization in pretrained language models

Grokking as the transition from lazy to rich training dynamics