- Academic Search

[PDF][PDF] Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

F Kunstner, A Milligan, R Yadav… - The Thirty-eighth …, 2024 - proceedings.neurips.cc

Adam has been shown to outperform gradient descent on large language models by a
larger margin than on other tasks, but it is unclear why. We show that a key factor in this …

Lagre Referanse Sitert av 28 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Adam-mini: Use fewer learning rates to gain more

Y Zhang, C Chen, Z Li, T Ding, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose Adam-mini, an optimizer that achieves on-par or better performance than
AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting …

Lagre Referanse Sitert av 22 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Provable adaptivity of adam under non-uniform smoothness

B Wang, Y Zhang, H Zhang, Q Meng, R Sun… - Proceedings of the 30th …, 2024 - dl.acm.org

Adam is widely adopted in practical applications due to its fast convergence. However, its
theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely …

Lagre Referanse Sitert av 49 Beslektede artikler Alle 5 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Encryption-friendly LLM architecture

D Rho, T Kim, M Park, JW Kim, H Chae… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) offer personalized responses based on user interactions,
but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a …

Lagre Referanse Sitert av 3 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: Sgd-like memory, adamw-level performance

H Zhu, Z Zhang, W Cong, X Liu, S Park… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are notoriously memory-intensive during training,
particularly with the popular AdamW optimizer. This memory burden necessitates using …

Lagre Referanse Sitert av 4 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Adam with model exponential moving average is effective for nonconvex optimization

K Ahn, A Cutkosky - arxiv preprint arxiv:2405.18199, 2024 - arxiv.org

In this work, we offer a theoretical analysis of two modern optimization techniques for
training large and complex models:(i) adaptive optimization algorithms, such as Adam, and …

Lagre Referanse Sitert av 4 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

No More Adam: Learning Rate Scaling at Initialization is All You Need

M Xu, L **ang, X Cai, H Wen - arxiv preprint arxiv:2412.11768, 2024 - arxiv.org

In this work, we question the necessity of adaptive gradient methods for training deep neural
networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Understanding Adam Requires Better Rotation Dependent Assumptions

L Maes, TH Zhang, A Jolicoeur-Martineau… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent
(SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

Z Chen, Q Li, A Banerjee - arxiv preprint arxiv:2411.06770, 2024 - arxiv.org

Combining gradient compression methods (eg, CountSketch, quantization) and adaptive
optimizers (eg, Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AdaGrad under Anisotropic Smoothness

Y Liu, R Pan, T Zhang - arxiv preprint arxiv:2406.15244, 2024 - arxiv.org

Adaptive gradient methods have been widely adopted in training large-scale deep neural
networks, especially large foundation models. Despite the huge success in practice, their …

Lagre Referanse Sitert av 2 Beslektede artikler HTML-versjon

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Why transformers need adam: A hessian perspective

[PDF][PDF] Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

Adam-mini: Use fewer learning rates to gain more

Provable adaptivity of adam under non-uniform smoothness

Encryption-friendly LLM architecture

Apollo: Sgd-like memory, adamw-level performance

Adam with model exponential moving average is effective for nonconvex optimization

No More Adam: Learning Rate Scaling at Initialization is All You Need

Understanding Adam Requires Better Rotation Dependent Assumptions

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

AdaGrad under Anisotropic Smoothness