A high-bias, low-variance introduction to machine learning for physicists

P Mehta, M Bukov, CH Wang, AGR Day, C Richardson… - Physics reports, 2019 - Elsevier
Abstract Machine Learning (ML) is one of the most exciting and dynamic areas of modern
research and application. The purpose of this review is to provide an introduction to the core …

Nonconvex optimization meets low-rank matrix factorization: An overview

Y Chi, YM Lu, Y Chen - IEEE Transactions on Signal …, 2019 - ieeexplore.ieee.org
Substantial progress has been made recently on develo** provably accurate and efficient
algorithms for low-rank matrix factorization via nonconvex optimization. While conventional …

Gradient descent finds global minima of deep neural networks

S Du, J Lee, H Li, L Wang… - … conference on machine …, 2019 - proceedings.mlr.press
Gradient descent finds a global minimum in training deep neural networks despite the
objective function being non-convex. The current paper proves gradient descent achieves …

Gradient descent provably optimizes over-parameterized neural networks

SS Du, X Zhai, B Poczos, A Singh - arxiv preprint arxiv:1810.02054, 2018 - arxiv.org
One of the mysteries in the success of neural networks is randomly initialized first order
methods like gradient descent can achieve zero training loss even though the objective …

Dying relu and initialization: Theory and numerical examples

L Lu, Y Shin, Y Su, GE Karniadakis - arxiv preprint arxiv:1903.06733, 2019 - arxiv.org
The dying ReLU refers to the problem when ReLU neurons become inactive and only output
0 for any input. There are many empirical and heuristic explanations of why ReLU neurons …

First-order methods almost always avoid strict saddle points

JD Lee, I Panageas, G Piliouras, M Simchowitz… - Mathematical …, 2019 - Springer
We establish that first-order methods avoid strict saddle points for almost all initializations.
Our results apply to a wide variety of first-order methods, including (manifold) gradient …

Understanding the acceleration phenomenon via high-resolution differential equations

B Shi, SS Du, MI Jordan, WJ Su - Mathematical Programming, 2022 - Springer
Gradient-based optimization algorithms can be studied from the perspective of limiting
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …

Fixing by mixing: A recipe for optimal byzantine ml under heterogeneity

Y Allouah, S Farhadkhani… - International …, 2023 - proceedings.mlr.press
Byzantine machine learning (ML) aims to ensure the resilience of distributed learning
algorithms to misbehaving (or Byzantine) machines. Although this problem received …

Neural collapse with normalized features: A geometric analysis over the riemannian manifold

C Yaras, P Wang, Z Zhu… - Advances in neural …, 2022 - proceedings.neurips.cc
When training overparameterized deep networks for classification tasks, it has been widely
observed that the learned features exhibit a so-called" neural collapse'" phenomenon. More …

On the power of over-parametrization in neural networks with quadratic activation

S Du, J Lee - International conference on machine learning, 2018 - proceedings.mlr.press
We provide new theoretical insights on why over-parametrization is effective in learning
neural networks. For a $ k $ hidden node shallow network with quadratic activation and $ n …