Average gradient outer product as a mechanism for deep neural collapse

D Beaglehole, P Súkeník, M Mondelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data
representations in the final layers of Deep Neural Networks (DNNs). Though the …

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

M Lu, B Wu, X Yang, D Zou - arxiv preprint arxiv:2310.17074, 2023 - arxiv.org
In this work, we theoretically investigate the generalization properties of neural networks
(NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under …

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

N Mallinar, D Beaglehole, L Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon
where the test accuracy starts improving long after the model achieves 100% training …

From stability to chaos: Analyzing gradient descent dynamics in quadratic regression

X Chen, K Balasubramanian, P Ghosal… - arxiv preprint arxiv …, 2023 - arxiv.org
We conduct a comprehensive investigation into the dynamics of gradient descent using
large-order constant step-sizes in the context of quadratic regression models. Within this …

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

CCJ Dominé, N Anguita, AM Proca, L Braun… - arxiv preprint arxiv …, 2024 - arxiv.org
Biological and artificial neural networks develop internal representations that enable them to
perform complex tasks. In artificial networks, the effectiveness of these models relies on their …

Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks

D Beaglehole, I Mitliagkas, A Agarwala - arxiv preprint arxiv:2402.05271, 2024 - arxiv.org
Understanding the mechanisms through which neural networks extract statistics from input-
label pairs is one of the most important unsolved problems in supervised learning. Prior …

Feature learning as alignment: a structural property of gradient descent in non-linear neural networks

D Beaglehole, I Mitliagkas… - Transactions on Machine …, 2024 - openreview.net
Understanding the mechanisms through which neural networks extract statistics from input-
label pairs through feature learning is one of the most important unsolved problems in …

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

D Kunin, A Raventós, C Dominé, F Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
While the impressive performance of modern neural networks is often attributed to their
capacity to efficiently extract task-relevant features from data, the mechanisms underlying …

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

SY Meng, A Orvieto, DY Cao, C De Sa - arxiv preprint arxiv:2406.05033, 2024 - arxiv.org
We study gradient descent (GD) dynamics on logistic regression problems with large,
constant step sizes. For linearly-separable data, it is known that GD converges to the …

Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization

A DePavia, V Charisopoulos, R Willett - arxiv preprint arxiv:2502.01594, 2025 - arxiv.org
Adaptive optimization algorithms--such as Adagrad, Adam, and their variants--have found
widespread use in machine learning, signal processing and many other settings. Several …