Don't use large mini-batches, use local sgd

T Lin, SU Stich, KK Patel, M Jaggi - ar** from sharp minima and regularization effects
Z Zhu, J Wu, B Yu, L Wu, J Ma - ar**, M Goldblum, PE Pope, M Moeller… - arxiv preprint arxiv …, 2021‏ - arxiv.org
It is widely believed that the implicit regularization of SGD is fundamental to the impressive
generalization behavior we observe in neural networks. In this work, we demonstrate that …

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Z **e, I Sato, M Sugiyama - arxiv preprint arxiv:2002.03495, 2020‏ - arxiv.org
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training
deep networks in practice. SGD is known to find a flat minimum that often generalizes well …

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

Q Li, C Tai, E Weinan - Journal of Machine Learning Research, 2019‏ - jmlr.org
We develop the mathematical foundations of the stochastic modified equations (SME)
framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is …

[PDF][PDF] Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error

GM Rotskoff, E Vanden-Eijnden - stat, 2018‏ - researchgate.net
Neural networks, a central tool in machine learning, have demonstrated remarkable, high
fidelity performance on image recognition and classification tasks. These successes evince …