Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
[PDF][PDF] Heavy-tailed class imbalance and why adam outperforms gradient descent on language models
Adam has been shown to outperform gradient descent on large language models by a
larger margin than on other tasks, but it is unclear why. We show that a key factor in this …
larger margin than on other tasks, but it is unclear why. We show that a key factor in this …
Adam-mini: Use fewer learning rates to gain more
We propose Adam-mini, an optimizer that achieves on-par or better performance than
AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting …
AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting …
Provable adaptivity of adam under non-uniform smoothness
Adam is widely adopted in practical applications due to its fast convergence. However, its
theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely …
theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely …
Encryption-friendly LLM architecture
Large language models (LLMs) offer personalized responses based on user interactions,
but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a …
but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a …
Apollo: Sgd-like memory, adamw-level performance
Large language models (LLMs) are notoriously memory-intensive during training,
particularly with the popular AdamW optimizer. This memory burden necessitates using …
particularly with the popular AdamW optimizer. This memory burden necessitates using …
Adam with model exponential moving average is effective for nonconvex optimization
In this work, we offer a theoretical analysis of two modern optimization techniques for
training large and complex models:(i) adaptive optimization algorithms, such as Adam, and …
training large and complex models:(i) adaptive optimization algorithms, such as Adam, and …
No More Adam: Learning Rate Scaling at Initialization is All You Need
In this work, we question the necessity of adaptive gradient methods for training deep neural
networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with …
networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with …
Understanding Adam Requires Better Rotation Dependent Assumptions
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent
(SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's …
(SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's …
Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis
Z Chen, Q Li, A Banerjee - arxiv preprint arxiv:2411.06770, 2024 - arxiv.org
Combining gradient compression methods (eg, CountSketch, quantization) and adaptive
optimizers (eg, Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential …
optimizers (eg, Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential …
AdaGrad under Anisotropic Smoothness
Adaptive gradient methods have been widely adopted in training large-scale deep neural
networks, especially large foundation models. Despite the huge success in practice, their …
networks, especially large foundation models. Despite the huge success in practice, their …