Recent advances in stochastic gradient descent in deep learning

Y Tian, Y Zhang, H Zhang - Mathematics, 2023 - mdpi.com
In the age of artificial intelligence, the best approach to handling huge amounts of data is a
tremendously motivating and hard problem. Among machine learning models, stochastic …

Federated optimization: Distributed machine learning for on-device intelligence

J Konečný, HB McMahan, D Ramage… - arxiv preprint arxiv …, 2016 - arxiv.org
We introduce a new and increasingly relevant setting for distributed optimization in machine
learning, where the data defining the optimization are unevenly distributed over an …

Gradient sparsification for communication-efficient distributed optimization

J Wangni, J Wang, J Liu… - Advances in Neural …, 2018 - proceedings.neurips.cc
Modern large-scale machine learning applications require stochastic optimization
algorithms to be implemented on distributed computational architectures. A key bottleneck is …

Optimization methods for large-scale machine learning

L Bottou, FE Curtis, J Nocedal - SIAM review, 2018 - SIAM
This paper provides a review and commentary on the past, present, and future of numerical
optimization algorithms in the context of machine learning applications. Through case …

Atomo: Communication-efficient learning via atomic sparsification

H Wang, S Sievert, S Liu, Z Charles… - Advances in neural …, 2018 - proceedings.neurips.cc
Distributed model training suffers from communication overheads due to frequent gradient
updates transmitted between compute nodes. To mitigate these overheads, several studies …

Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition

H Karimi, J Nutini, M Schmidt - Joint European conference on machine …, 2016 - Springer
In 1963, Polyak proposed a simple condition that is sufficient to show a global linear
convergence rate for gradient descent. This condition is a special case of the Łojasiewicz …

LAG: Lazily aggregated gradient for communication-efficient distributed learning

T Chen, G Giannakis, T Sun… - Advances in neural …, 2018 - proceedings.neurips.cc
This paper presents a new class of gradient methods for distributed machine learning that
adaptively skip the gradient calculations to learn with reduced communication and …

Coordinate descent algorithms

SJ Wright - Mathematical programming, 2015 - Springer
Coordinate descent algorithms solve optimization problems by successively performing
approximate minimization along coordinate directions or coordinate hyperplanes. They have …

Asynchronous parallel stochastic gradient for nonconvex optimization

X Lian, Y Huang, Y Li, J Liu - Advances in neural …, 2015 - proceedings.neurips.cc
The asynchronous parallel implementations of stochastic gradient (SG) have been broadly
used in solving deep neural network and received many successes in practice recently …

A unified algorithmic framework for block-structured optimization involving big data: With applications in machine learning and signal processing

M Hong, M Razaviyayn, ZQ Luo… - IEEE Signal Processing …, 2015 - ieeexplore.ieee.org
This article presents a powerful algorithmic framework for big data optimization, called the
block successive upper-bound minimization (BSUM). The BSUM includes as special cases …