[PDF][PDF] Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

F Kunstner, A Milligan, R Yadav… - The Thirty-eighth …, 2024 - proceedings.neurips.cc
Adam has been shown to outperform gradient descent on large language models by a
larger margin than on other tasks, but it is unclear why. We show that a key factor in this …

Adam-mini: Use fewer learning rates to gain more

Y Zhang, C Chen, Z Li, T Ding, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose Adam-mini, an optimizer that achieves on-par or better performance than
AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting …

Provable adaptivity of adam under non-uniform smoothness

B Wang, Y Zhang, H Zhang, Q Meng, R Sun… - Proceedings of the 30th …, 2024 - dl.acm.org
Adam is widely adopted in practical applications due to its fast convergence. However, its
theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely …

Encryption-friendly LLM architecture

D Rho, T Kim, M Park, JW Kim, H Chae… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) offer personalized responses based on user interactions,
but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a …

Apollo: Sgd-like memory, adamw-level performance

H Zhu, Z Zhang, W Cong, X Liu, S Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are notoriously memory-intensive during training,
particularly with the popular AdamW optimizer. This memory burden necessitates using …

Adam with model exponential moving average is effective for nonconvex optimization

K Ahn, A Cutkosky - arxiv preprint arxiv:2405.18199, 2024 - arxiv.org
In this work, we offer a theoretical analysis of two modern optimization techniques for
training large and complex models:(i) adaptive optimization algorithms, such as Adam, and …

No More Adam: Learning Rate Scaling at Initialization is All You Need

M Xu, L **ang, X Cai, H Wen - arxiv preprint arxiv:2412.11768, 2024 - arxiv.org
In this work, we question the necessity of adaptive gradient methods for training deep neural
networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with …

Understanding Adam Requires Better Rotation Dependent Assumptions

L Maes, TH Zhang, A Jolicoeur-Martineau… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent
(SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's …

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

Z Chen, Q Li, A Banerjee - arxiv preprint arxiv:2411.06770, 2024 - arxiv.org
Combining gradient compression methods (eg, CountSketch, quantization) and adaptive
optimizers (eg, Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential …

AdaGrad under Anisotropic Smoothness

Y Liu, R Pan, T Zhang - arxiv preprint arxiv:2406.15244, 2024 - arxiv.org
Adaptive gradient methods have been widely adopted in training large-scale deep neural
networks, especially large foundation models. Despite the huge success in practice, their …