Going beyond linear mode connectivity: The layerwise linear feature connectivity
Recent work has revealed many intriguing empirical phenomena in neural network training,
despite the poorly understood and highly complex loss landscapes and training dynamics …
despite the poorly understood and highly complex loss landscapes and training dynamics …
Dichotomy of early and late phase implicit biases can provably induce grokking
Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …
Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic
In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …
internal representations harnessed by neural networks and Transformers. Building on recent …
Deep networks always grok and here is why
Grokking, or delayed generalization, is a phenomenon where generalization in a deep
neural network (DNN) occurs long after achieving near zero training error. Previous studies …
neural network (DNN) occurs long after achieving near zero training error. Previous studies …
Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization
Transformers have demonstrated great power in the recent development of large
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …
Why do you grok? a theoretical analysis of grokking modular addition
We present a theoretical explanation of the``grokking''phenomenon, where a model
generalizes long after overfitting, for the originally-studied problem of modular addition. First …
generalizes long after overfitting, for the originally-studied problem of modular addition. First …
Benign overfitting in single-head attention
The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy
training data but still achieves near-optimal test performance, has been extensively studied …
training data but still achieves near-optimal test performance, has been extensively studied …
Interpreting grokked transformers in complex modular arithmetic
Grokking has been actively explored to reveal the mystery of delayed generalization.
Identifying interpretable algorithms inside the grokked models is a suggestive hint to …
Identifying interpretable algorithms inside the grokked models is a suggestive hint to …
Approaching deep learning through the spectral dynamics of weights
We propose an empirical approach centered on the spectral dynamics of weights--the
behavior of singular values and vectors during optimization--to unify and clarify several …
behavior of singular values and vectors during optimization--to unify and clarify several …
Benign or not-benign overfitting in token selection of attention mechanism
K Sakamoto, I Sato - arxiv preprint arxiv:2409.17625, 2024 - arxiv.org
Modern over-parameterized neural networks can be trained to fit the training data perfectly
while still maintaining a high generalization performance. This" benign overfitting" …
while still maintaining a high generalization performance. This" benign overfitting" …