Optimization for deep learning: An overview
RY Sun - Journal of the Operations Research Society of China, 2020 - Springer
Optimization is a critical component in deep learning. We think optimization for neural
networks is an interesting topic for theoretical research due to various reasons. First, its …
networks is an interesting topic for theoretical research due to various reasons. First, its …
Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning
Diffusion models have proven to be highly effective in generating high-quality images.
However, adapting large pre-trained diffusion models to new domains remains an open …
However, adapting large pre-trained diffusion models to new domains remains an open …
Optimization for deep learning: theory and algorithms
R Sun - arxiv preprint arxiv:1912.08957, 2019 - arxiv.org
When and why can a neural network be successfully trained? This article provides an
overview of optimization algorithms and theory for training neural networks. First, we discuss …
overview of optimization algorithms and theory for training neural networks. First, we discuss …
An improved analysis of training over-parameterized deep neural networks
A recent line of research has shown that gradient-based algorithms with random
initialization can converge to the global minima of the training loss for over-parameterized …
initialization can converge to the global minima of the training loss for over-parameterized …
Fast convergence of natural gradient descent for over-parameterized neural networks
Natural gradient descent has proven very effective at mitigating the catastrophic effects of
pathological curvature in the objective function, but little is known theoretically about its …
pathological curvature in the objective function, but little is known theoretically about its …
Learning one-hidden-layer relu networks via gradient descent
We study the problem of learning one-hidden-layer neural networks with Rectified Linear
Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian …
Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian …
Generalization error bounds of gradient descent for learning over-parameterized deep relu networks
Empirical studies show that gradient-based methods can learn deep neural networks
(DNNs) with very good generalization performance in the over-parameterization regime …
(DNNs) with very good generalization performance in the over-parameterization regime …
From symmetry to geometry: Tractable nonconvex problems
As science and engineering have become increasingly data-driven, the role of optimization
has expanded to touch almost every stage of the data analysis pipeline, from signal and …
has expanded to touch almost every stage of the data analysis pipeline, from signal and …
Provably learning a multi-head attention layer
The multi-head attention layer is one of the key components of the transformer architecture
that sets it apart from traditional feed-forward models. Given a sequence length $ k …
that sets it apart from traditional feed-forward models. Given a sequence length $ k …
Learning deep relu networks is fixed-parameter tractable
We consider the problem of learning an unknown ReLU network with respect to Gaussian
inputs and obtain the first nontrivial results for networks of depth more than two. We give an …
inputs and obtain the first nontrivial results for networks of depth more than two. We give an …