- Academic Search

D Beaglehole, P Súkeník, M Mondelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data
representations in the final layers of Deep Neural Networks (DNNs). Though the …

Save Cite Cited by 10 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

M Lu, B Wu, X Yang, D Zou - arxiv preprint arxiv:2310.17074, 2023 - arxiv.org

In this work, we theoretically investigate the generalization properties of neural networks
(NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under …

Save Cite Cited by 9 Related articles All 8 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

N Mallinar, D Beaglehole, L Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon
where the test accuracy starts improving long after the model achieves 100% training …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

From stability to chaos: Analyzing gradient descent dynamics in quadratic regression

X Chen, K Balasubramanian, P Ghosal… - arxiv preprint arxiv …, 2023 - arxiv.org

We conduct a comprehensive investigation into the dynamics of gradient descent using
large-order constant step-sizes in the context of quadratic regression models. Within this …

Save Cite Cited by 5 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

CCJ Dominé, N Anguita, AM Proca, L Braun… - arxiv preprint arxiv …, 2024 - arxiv.org

Biological and artificial neural networks develop internal representations that enable them to
perform complex tasks. In artificial networks, the effectiveness of these models relies on their …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks

D Beaglehole, I Mitliagkas, A Agarwala - arxiv preprint arxiv:2402.05271, 2024 - arxiv.org

Understanding the mechanisms through which neural networks extract statistics from input-
label pairs is one of the most important unsolved problems in supervised learning. Prior …

Save Cite Cited by 3 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Feature learning as alignment: a structural property of gradient descent in non-linear neural networks

D Beaglehole, I Mitliagkas… - Transactions on Machine …, 2024 - openreview.net

Understanding the mechanisms through which neural networks extract statistics from input-
label pairs through feature learning is one of the most important unsolved problems in …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

D Kunin, A Raventós, C Dominé, F Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

While the impressive performance of modern neural networks is often attributed to their
capacity to efficiently extract task-relevant features from data, the mechanisms underlying …

Save Cite Cited by 6 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

SY Meng, A Orvieto, DY Cao, C De Sa - arxiv preprint arxiv:2406.05033, 2024 - arxiv.org

We study gradient descent (GD) dynamics on logistic regression problems with large,
constant step sizes. For linearly-separable data, it is known that GD converges to the …

[Free GPT-4]

[PDF] arxiv.org

Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization

A DePavia, V Charisopoulos, R Willett - arxiv preprint arxiv:2502.01594, 2025 - arxiv.org

Adaptive optimization algorithms--such as Adagrad, Adam, and their variants--have found
widespread use in machine learning, signal processing and many other settings. Several …

Save Cite Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Catapults in SGD: spikes in the training loss and their impact on generalization through...

Average gradient outer product as a mechanism for deep neural collapse

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

From stability to chaos: Analyzing gradient descent dynamics in quadratic regression

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks

Feature learning as alignment: a structural property of gradient descent in non-linear neural networks

Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning

Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization