- Academic Search

A Kosson, B Messmer, M Jaggi - Advances in Neural …, 2025 - proceedings.neurips.cc

Abstract Learning Rate Warmup is a popular heuristic for training neural networks,
especially at larger batch sizes, despite limited understanding of its benefits. Warmup …

Spara Citera Citerat av 2 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

When and why momentum accelerates sgd: An empirical study

J Fu, B Wang, H Zhang, Z Zhang, W Chen… - arxiv preprint arxiv …, 2023 - arxiv.org

Momentum has become a crucial component in deep learning optimizers, necessitating a
comprehensive understanding of when and why it accelerates stochastic gradient descent …

Spara Citera Citerat av 8 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Adamem: Memory efficient momentum for adafactor

N Vyas, D Morwani, SM Kakade - 2nd Workshop on Advancing …, 2024 - openreview.net

Adafactor is a memory efficient algorithm which does not maintain momentum and has near
0 memory overhead as compared to gradient descent. However it performs worse than …

Spara Citera Citerat av 4 Relaterade artiklar Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Risk bounds of accelerated SGD for overparameterized linear regression

X Li, Y Deng, J Wu, D Zhou, Q Gu - arxiv preprint arxiv:2311.14222, 2023 - arxiv.org

Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often
achieves better generalization performance than SGD. However, existing optimization …

Spara Citera Citerat av 3 Relaterade artiklar Alla 9 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A quadratic synchronization rule for distributed deep learning

X Gu, K Lyu, S Arora, J Zhang, L Huang - arxiv preprint arxiv:2310.14423, 2023 - arxiv.org

In distributed deep learning with data parallelism, synchronizing gradients at each training
step can cause a huge communication overhead, especially when many nodes work …

Spara Citera Citerat av 1 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise

R Pan, Y Liu, X Wang, T Zhang - arxiv preprint arxiv:2312.14567, 2023 - arxiv.org

Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing
deep learning models. In contrast to its empirical popularity, the understanding of its …

Spara Citera Citerat av 2 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

X Li, J Luo, Z Zheng, H Wang, L Luo, L Wen… - arxiv preprint arxiv …, 2024 - arxiv.org

Momentum-based optimizers are widely adopted for training neural networks. However, the
optimal selection of momentum coefficients remains elusive. This uncertainty impedes a …

Spara Citera Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

A Kosson, B Messmer, M Jaggi - High-dimensional Learning …, 2024 - openreview.net

Learning Rate Warmup is a popular heuristic for training neural networks, which downscales
early updates relative to later ones. This aids training, suggesting that the initial updates are …

Spara Citera Relaterade artiklar Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

A Dang, R Babanezhad, S Vaswani - arxiv preprint arxiv:2401.06738, 2024 - arxiv.org

Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models,
and often provides empirical improvements over stochastic gradient descent. By primarily …

Spara Citera Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] yale.edu

Algorithm Dynamics in Modern Statistical Learning: Asymptotics, Universality, and Implicit Regularization

T Wang - 2024 - search.proquest.com

Understanding the dynamics of algorithms is crucial for characterizing the behavior of
trained models in modern statistical learning. This thesis presents a few recent results on …

Spara Citera Relaterade artiklar Alla 2 versionerna

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

The marginal value of momentum for small learning rate sgd

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

When and why momentum accelerates sgd: An empirical study

Adamem: Memory efficient momentum for adafactor

Risk bounds of accelerated SGD for overparameterized linear regression

A quadratic synchronization rule for distributed deep learning

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Analyzing & Eliminating Learning Rate Warmup in GPT Pre-Training

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Algorithm Dynamics in Modern Statistical Learning: Asymptotics, Universality, and Implicit Regularization