Why transformers need adam: A hessian perspective

Y Zhang, C Chen, T Ding, Z Li… - Advances in Neural …, 2025 - proceedings.neurips.cc
SGD performs worse than Adam by a significant margin on Transformers, but the reason
remains unclear. In this work, we provide an explanation through the lens of Hessian:(i) …

Acceleration by stepsize hedging: Multi-step descent and the silver stepsize schedule

J Altschuler, P Parrilo - Journal of the ACM, 2023 - dl.acm.org
Can we accelerate the convergence of gradient descent without changing the algorithm—
just by judiciously choosing stepsizes? Surprisingly, we show that the answer is yes. Our …

Compute-efficient deep learning: Algorithmic trends and opportunities

BR Bartoldson, B Kailkhura, D Blalock - Journal of Machine Learning …, 2023 - jmlr.org
Although deep learning has made great progress in recent years, the exploding economic
and environmental costs of training neural networks are becoming unsustainable. To …

Provably faster gradient descent via long steps

B Grimmer - SIAM Journal on Optimization, 2024 - SIAM
This work establishes new convergence guarantees for gradient descent in smooth convex
optimization via a computer-assisted analysis technique. Our theory allows nonconstant …

Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods

S Das Gupta, BPG Van Parys, EK Ryu - Mathematical Programming, 2024 - Springer
We present the Branch-and-Bound Performance Estimation Programming (BnB-PEP), a
unified methodology for constructing optimal first-order methods for convex and nonconvex …

FedP3: Federated personalized and privacy-friendly network pruning under model heterogeneity

K Yi, N Gazagnadou, P Richtárik, L Lyu - arxiv preprint arxiv:2404.09816, 2024 - arxiv.org
The interest in federated learning has surged in recent research due to its unique ability to
train a global model using privacy-secured information held locally on each client. This …

On fundamental proof structures in first-order optimization

B Goujaud, A Dieuleveut… - 2023 62nd IEEE …, 2023 - ieeexplore.ieee.org
First-order optimization methods have attracted a lot of attention due to their practical
success in many applications, including in machine learning. Obtaining convergence …

Towards a better theoretical understanding of independent subnetwork training

E Shulgin, P Richtárik - arxiv preprint arxiv:2306.16484, 2023 - arxiv.org
Modern advancements in large-scale machine learning would be impossible without the
paradigm of data-parallel distributed computing. Since distributed computing with large …

Variable step sizes for iterative Jacobian-based inverse kinematics of robotic manipulators

J Colan, A Davila, Y Hasegawa - IEEE Access, 2024 - ieeexplore.ieee.org
This study evaluates the impact of step size selection on Jacobian-based inverse kinematics
(IK) for robotic manipulators. Although traditional constant step size approaches offer …

Block acceleration without momentum: On optimal stepsizes of block gradient descent for least-squares

L Peng, W Yin - arxiv preprint arxiv:2405.16020, 2024 - arxiv.org
Block coordinate descent is a powerful algorithmic template suitable for big data
optimization. This template admits a lot of variants including block gradient descent (BGD) …