Recent advances in stochastic gradient descent in deep learning

Y Tian, Y Zhang, H Zhang - Mathematics, 2023 - mdpi.com
In the age of artificial intelligence, the best approach to handling huge amounts of data is a
tremendously motivating and hard problem. Among machine learning models, stochastic …

Variance-reduced methods for machine learning

RM Gower, M Schmidt, F Bach… - Proceedings of the …, 2020 - ieeexplore.ieee.org
Stochastic optimization lies at the heart of machine learning, and its cornerstone is
stochastic gradient descent (SGD), a method introduced over 60 years ago. The last eight …

Scaffold: Stochastic controlled averaging for federated learning

SP Karimireddy, S Kale, M Mohri… - International …, 2020 - proceedings.mlr.press
Federated learning is a key scenario in modern large-scale machine learning where the
data remains distributed over a large number of clients and the task is to learn a centralized …

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

J Zhuang, T Tang, Y Ding… - Advances in neural …, 2020 - proceedings.neurips.cc
Most popular optimizers for deep learning can be broadly categorized as adaptive methods
(eg~ Adam) and accelerated schemes (eg~ stochastic gradient descent (SGD) with …

Learning to reweight examples for robust deep learning

M Ren, W Zeng, B Yang… - … conference on machine …, 2018 - proceedings.mlr.press
Deep neural networks have been shown to be very powerful modeling tools for many
supervised learning tasks involving complex input patterns. However, they can also easily …

Federated optimization: Distributed machine learning for on-device intelligence

J Konečný, HB McMahan, D Ramage… - arxiv preprint arxiv …, 2016 - arxiv.org
We introduce a new and increasingly relevant setting for distributed optimization in machine
learning, where the data defining the optimization are unevenly distributed over an …

Large batch optimization for deep learning: Training bert in 76 minutes

Y You, J Li, S Reddi, J Hseu, S Kumar… - arxiv preprint arxiv …, 2019 - arxiv.org
Training large deep neural networks on massive datasets is computationally very
challenging. There has been recent surge in interest in using large batch stochastic …

A survey of optimization methods from a machine learning perspective

S Sun, Z Cao, H Zhu, J Zhao - IEEE transactions on cybernetics, 2019 - ieeexplore.ieee.org
Machine learning develops rapidly, which has made many theoretical breakthroughs and is
widely applied in various fields. Optimization, as an important part of machine learning, has …

Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition

H Karimi, J Nutini, M Schmidt - … Conference, ECML PKDD 2016, Riva del …, 2016 - Springer
In 1963, Polyak proposed a simple condition that is sufficient to show a global linear
convergence rate for gradient descent. This condition is a special case of the Łojasiewicz …

Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator

C Fang, CJ Li, Z Lin, T Zhang - Advances in neural …, 2018 - proceedings.neurips.cc
In this paper, we propose a new technique named\textit {Stochastic Path-Integrated
Differential EstimatoR}(SPIDER), which can be used to track many deterministic quantities of …