Average gradient outer product as a mechanism for deep neural collapse
Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data
representations in the final layers of Deep Neural Networks (DNNs). Though the …
representations in the final layers of Deep Neural Networks (DNNs). Though the …
Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates
In this work, we theoretically investigate the generalization properties of neural networks
(NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under …
(NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under …
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon
where the test accuracy starts improving long after the model achieves 100% training …
where the test accuracy starts improving long after the model achieves 100% training …
From stability to chaos: Analyzing gradient descent dynamics in quadratic regression
We conduct a comprehensive investigation into the dynamics of gradient descent using
large-order constant step-sizes in the context of quadratic regression models. Within this …
large-order constant step-sizes in the context of quadratic regression models. Within this …
From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks
Biological and artificial neural networks develop internal representations that enable them to
perform complex tasks. In artificial networks, the effectiveness of these models relies on their …
perform complex tasks. In artificial networks, the effectiveness of these models relies on their …
Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks
Understanding the mechanisms through which neural networks extract statistics from input-
label pairs is one of the most important unsolved problems in supervised learning. Prior …
label pairs is one of the most important unsolved problems in supervised learning. Prior …
Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Understanding the mechanisms through which neural networks extract statistics from input-
label pairs through feature learning is one of the most important unsolved problems in …
label pairs through feature learning is one of the most important unsolved problems in …
Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning
While the impressive performance of modern neural networks is often attributed to their
capacity to efficiently extract task-relevant features from data, the mechanisms underlying …
capacity to efficiently extract task-relevant features from data, the mechanisms underlying …
Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
We study gradient descent (GD) dynamics on logistic regression problems with large,
constant step sizes. For linearly-separable data, it is known that GD converges to the …
constant step sizes. For linearly-separable data, it is known that GD converges to the …
Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization
A DePavia, V Charisopoulos, R Willett - arxiv preprint arxiv:2502.01594, 2025 - arxiv.org
Adaptive optimization algorithms--such as Adagrad, Adam, and their variants--have found
widespread use in machine learning, signal processing and many other settings. Several …
widespread use in machine learning, signal processing and many other settings. Several …