- Academic Search

Z Zhou, Y Yang, X Yang, J Yan… - Advances in Neural …, 2023 - proceedings.neurips.cc

Recent work has revealed many intriguing empirical phenomena in neural network training,
despite the poorly understood and highly complex loss landscapes and training dynamics …

Save Cite Cited by 20 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Dichotomy of early and late phase implicit biases can provably induce grokking

K Lyu, J **, Z Li, SS Du, JD Lee, W Hu - arxiv preprint arxiv:2311.18817, 2023 - arxiv.org

Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …

Save Cite Cited by 27 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

J Gu, C Li, Y Liang, Z Shi, Z Song… - arxiv preprint arxiv …, 2024 - openreview.net

In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …

Save Cite Cited by 18 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Deep networks always grok and here is why

AI Humayun, R Balestriero, R Baraniuk - arxiv preprint arxiv:2402.15555, 2024 - arxiv.org

Grokking, or delayed generalization, is a phenomenon where generalization in a deep
neural network (DNN) occurs long after achieving near zero training error. Previous studies …

Save Cite Cited by 11 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

J Jiang, W Huang, M Zhang, T Suzuki, L Nie - arxiv preprint arxiv …, 2024 - arxiv.org

Transformers have demonstrated great power in the recent development of large
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Why do you grok? a theoretical analysis of grokking modular addition

MA Mohamadi, Z Li, L Wu, DJ Sutherland - arxiv preprint arxiv …, 2024 - arxiv.org

We present a theoretical explanation of the``grokking''phenomenon, where a model
generalizes long after overfitting, for the originally-studied problem of modular addition. First …

Save Cite Cited by 4 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Benign overfitting in single-head attention

R Magen, S Shang, Z Xu, S Frei, W Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy
training data but still achieves near-optimal test performance, has been extensively studied …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Interpreting grokked transformers in complex modular arithmetic

H Furuta, M Gouki, Y Iwasawa, Y Matsuo - arxiv preprint arxiv:2402.16726, 2024 - arxiv.org

Grokking has been actively explored to reveal the mystery of delayed generalization.
Identifying interpretable algorithms inside the grokked models is a suggestive hint to …

Save Cite Cited by 5 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Approaching deep learning through the spectral dynamics of weights

D Yunis, KK Patel, S Wheeler, P Savarese… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose an empirical approach centered on the spectral dynamics of weights--the
behavior of singular values and vectors during optimization--to unify and clarify several …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Benign or not-benign overfitting in token selection of attention mechanism

K Sakamoto, I Sato - arxiv preprint arxiv:2409.17625, 2024 - arxiv.org

Modern over-parameterized neural networks can be trained to fit the training data perfectly
while still maintaining a high generalization performance. This" benign overfitting" …

Save Cite Cited by 1 Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Benign overfitting and grokking in relu networks for xor cluster data

Going beyond linear mode connectivity: The layerwise linear feature connectivity

Dichotomy of early and late phase implicit biases can provably induce grokking

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

Deep networks always grok and here is why

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

Why do you grok? a theoretical analysis of grokking modular addition

Benign overfitting in single-head attention

Interpreting grokked transformers in complex modular arithmetic

Approaching deep learning through the spectral dynamics of weights

Benign or not-benign overfitting in token selection of attention mechanism