Recent advances in stochastic gradient descent in deep learning
In the age of artificial intelligence, the best approach to handling huge amounts of data is a
tremendously motivating and hard problem. Among machine learning models, stochastic …
tremendously motivating and hard problem. Among machine learning models, stochastic …
Variance-reduced methods for machine learning
Stochastic optimization lies at the heart of machine learning, and its cornerstone is
stochastic gradient descent (SGD), a method introduced over 60 years ago. The last eight …
stochastic gradient descent (SGD), a method introduced over 60 years ago. The last eight …
Scaffold: Stochastic controlled averaging for federated learning
Federated learning is a key scenario in modern large-scale machine learning where the
data remains distributed over a large number of clients and the task is to learn a centralized …
data remains distributed over a large number of clients and the task is to learn a centralized …
Adabelief optimizer: Adapting stepsizes by the belief in observed gradients
Most popular optimizers for deep learning can be broadly categorized as adaptive methods
(eg~ Adam) and accelerated schemes (eg~ stochastic gradient descent (SGD) with …
(eg~ Adam) and accelerated schemes (eg~ stochastic gradient descent (SGD) with …
Learning to reweight examples for robust deep learning
Deep neural networks have been shown to be very powerful modeling tools for many
supervised learning tasks involving complex input patterns. However, they can also easily …
supervised learning tasks involving complex input patterns. However, they can also easily …
Federated optimization: Distributed machine learning for on-device intelligence
We introduce a new and increasingly relevant setting for distributed optimization in machine
learning, where the data defining the optimization are unevenly distributed over an …
learning, where the data defining the optimization are unevenly distributed over an …
Large batch optimization for deep learning: Training bert in 76 minutes
Training large deep neural networks on massive datasets is computationally very
challenging. There has been recent surge in interest in using large batch stochastic …
challenging. There has been recent surge in interest in using large batch stochastic …
A survey of optimization methods from a machine learning perspective
Machine learning develops rapidly, which has made many theoretical breakthroughs and is
widely applied in various fields. Optimization, as an important part of machine learning, has …
widely applied in various fields. Optimization, as an important part of machine learning, has …
Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition
In 1963, Polyak proposed a simple condition that is sufficient to show a global linear
convergence rate for gradient descent. This condition is a special case of the Łojasiewicz …
convergence rate for gradient descent. This condition is a special case of the Łojasiewicz …
Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator
In this paper, we propose a new technique named\textit {Stochastic Path-Integrated
Differential EstimatoR}(SPIDER), which can be used to track many deterministic quantities of …
Differential EstimatoR}(SPIDER), which can be used to track many deterministic quantities of …