Generalization bounds: Perspectives from information theory and PAC-Bayes
A fundamental question in theoretical machine learning is generalization. Over the past
decades, the PAC-Bayesian approach has been established as a flexible framework to …
decades, the PAC-Bayesian approach has been established as a flexible framework to …
Sgd with large step sizes learns sparse features
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …
in the training of neural networks. We present empirical observations that commonly used …
PAC-Bayes compression bounds so tight that they can explain generalization
While there has been progress in develo** non-vacuous generalization bounds for deep
neural networks, these bounds tend to be uninformative about why deep learning works. In …
neural networks, these bounds tend to be uninformative about why deep learning works. In …
When do flat minima optimizers work?
Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods,
have been shown to improve a neural network's generalization performance over stochastic …
have been shown to improve a neural network's generalization performance over stochastic …
(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective
We discuss methods for visualizing neural network decision boundaries and decision
regions. We use these visualizations to investigate issues related to reproducibility and …
regions. We use these visualizations to investigate issues related to reproducibility and …
Subspace adversarial training
Single-step adversarial training (AT) has received wide attention as it proved to be both
efficient and robust. However, a serious problem of catastrophic overfitting exists, ie, the …
efficient and robust. However, a serious problem of catastrophic overfitting exists, ie, the …
Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks
In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives
overly expressive networks to much simpler subnetworks, thereby dramatically reducing the …
overly expressive networks to much simpler subnetworks, thereby dramatically reducing the …
Why neural networks find simple solutions: The many regularizers of geometric complexity
In many contexts, simpler models are preferable to more complex models and the control of
this model complexity is the goal for many methods in machine learning such as …
this model complexity is the goal for many methods in machine learning such as …
Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be
The success of the Adam optimizer on a wide array of architectures has made it the default
in settings where stochastic gradient descent (SGD) performs poorly. However, our …
in settings where stochastic gradient descent (SGD) performs poorly. However, our …