Adam-mini: Use fewer learning rates to gain more

Y Zhang, C Chen, Z Li, T Ding, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose Adam-mini, an optimizer that achieves on par or better performance than
AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the …

Soap: Improving and stabilizing shampoo using adam

N Vyas, D Morwani, R Zhao, I Shapira… - arxiv preprint arxiv …, 2024 - arxiv.org
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …

Cautious optimizers: Improving training with one line of code

K Liang, L Chen, B Liu, Q Liu - arxiv preprint arxiv:2411.16085, 2024 - arxiv.org
AdamW has been the default optimizer for transformer pretraining. For many years, our
community searches for faster and more stable optimizers with only constraint positive …

Rethinking conventional wisdom in machine learning: From generalization to scaling

L **ao - arxiv preprint arxiv:2409.15156, 2024 - arxiv.org
The remarkable success of large language pretraining and the discovery of scaling laws
signify a paradigm shift in machine learning. Notably, the primary objective has evolved from …

JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

B Clavié - arxiv preprint arxiv:2407.20750, 2024 - arxiv.org
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress
in lower-resource ones such as Japanese has been hindered by data scarcity, among other …

4-bit Shampoo for Memory-Efficient Network Training

S Wang, P Zhou, J Li, H Huang - Advances in Neural …, 2025 - proceedings.neurips.cc
Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-
order optimizers in both theory and practice. The states forming the preconditioner and its …

How Does Critical Batch Size Scale in Pre-training?

H Zhang, D Morwani, N Vyas, J Wu, D Zou… - arxiv preprint arxiv …, 2024 - arxiv.org
Training large-scale models under given resources requires careful design of parallelism
strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the …

An adaptive stochastic gradient method with non-negative gauss-newton stepsizes

A Orvieto, L **ao - arxiv preprint arxiv:2407.04358, 2024 - arxiv.org
We consider the problem of minimizing the average of a large number of smooth but
possibly non-convex functions. In the context of most machine learning applications, each …

General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

K Ahn, G Magakyan, A Cutkosky - arxiv preprint arxiv:2411.07061, 2024 - arxiv.org
This work investigates the effectiveness of schedule-free methods, developed by A. Defazio
et al.(NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable …

AI-driven skin cancer diagnosis: Grad-CAM and expert annotations for enhanced interpretability

I Matas, C Serrano, F Silva, A Serrano… - arxiv preprint arxiv …, 2024 - arxiv.org
An AI tool has been developed to provide interpretable support for the diagnosis of BCC via
teledermatology, thus speeding up referrals and optimizing resource utilization. The …