Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

A Kosson, B Messmer, M Jaggi - Advances in Neural …, 2025 - proceedings.neurips.cc
Abstract Learning Rate Warmup is a popular heuristic for training neural networks,
especially at larger batch sizes, despite limited understanding of its benefits. Warmup …

When and why momentum accelerates sgd: An empirical study

J Fu, B Wang, H Zhang, Z Zhang, W Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
Momentum has become a crucial component in deep learning optimizers, necessitating a
comprehensive understanding of when and why it accelerates stochastic gradient descent …

Adamem: Memory efficient momentum for adafactor

N Vyas, D Morwani, SM Kakade - 2nd Workshop on Advancing …, 2024 - openreview.net
Adafactor is a memory efficient algorithm which does not maintain momentum and has near
0 memory overhead as compared to gradient descent. However it performs worse than …

Risk bounds of accelerated SGD for overparameterized linear regression

X Li, Y Deng, J Wu, D Zhou, Q Gu - arxiv preprint arxiv:2311.14222, 2023 - arxiv.org
Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often
achieves better generalization performance than SGD. However, existing optimization …

A quadratic synchronization rule for distributed deep learning

X Gu, K Lyu, S Arora, J Zhang, L Huang - arxiv preprint arxiv:2310.14423, 2023 - arxiv.org
In distributed deep learning with data parallelism, synchronizing gradients at each training
step can cause a huge communication overhead, especially when many nodes work …

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise

R Pan, Y Liu, X Wang, T Zhang - arxiv preprint arxiv:2312.14567, 2023 - arxiv.org
Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing
deep learning models. In contrast to its empirical popularity, the understanding of its …

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

X Li, J Luo, Z Zheng, H Wang, L Luo, L Wen… - arxiv preprint arxiv …, 2024 - arxiv.org
Momentum-based optimizers are widely adopted for training neural networks. However, the
optimal selection of momentum coefficients remains elusive. This uncertainty impedes a …

Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

A Kosson, B Messmer, M Jaggi - High-dimensional Learning …, 2024 - openreview.net
Learning Rate Warmup is a popular heuristic for training neural networks, which downscales
early updates relative to later ones. This aids training, suggesting that the initial updates are …

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

A Dang, R Babanezhad, S Vaswani - arxiv preprint arxiv:2401.06738, 2024 - arxiv.org
Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models,
and often provides empirical improvements over stochastic gradient descent. By primarily …

Algorithm Dynamics in Modern Statistical Learning: Asymptotics, Universality, and Implicit Regularization

T Wang - 2024 - search.proquest.com
Understanding the dynamics of algorithms is crucial for characterizing the behavior of
trained models in modern statistical learning. This thesis presents a few recent results on …