Don't use large mini-batches, use local sgd
T Lin, SU Stich, KK Patel, M Jaggi - ar** from sharp minima and regularization effects
Z Zhu, J Wu, B Yu, L Wu, J Ma - ar**, M Goldblum, PE Pope, M Moeller… - arxiv preprint arxiv …, 2021 - arxiv.org
It is widely believed that the implicit regularization of SGD is fundamental to the impressive
generalization behavior we observe in neural networks. In this work, we demonstrate that …
generalization behavior we observe in neural networks. In this work, we demonstrate that …
A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training
deep networks in practice. SGD is known to find a flat minimum that often generalizes well …
deep networks in practice. SGD is known to find a flat minimum that often generalizes well …
Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations
We develop the mathematical foundations of the stochastic modified equations (SME)
framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is …
framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is …
[PDF][PDF] Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error
Neural networks, a central tool in machine learning, have demonstrated remarkable, high
fidelity performance on image recognition and classification tasks. These successes evince …
fidelity performance on image recognition and classification tasks. These successes evince …