Soap: Improving and stabilizing shampoo using adam

N Vyas, D Morwani, R Zhao, I Shapira… - arxiv preprint arxiv …, 2024 - arxiv.org
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …

Approximated orthogonal projection unit: stabilizing regression network training using natural gradient

S Wang, C Yang, S Lou - arxiv preprint arxiv:2409.15393, 2024 - arxiv.org
Neural networks (NN) are extensively studied in cutting-edge soft sensor models due to their
feature extraction and function approximation capabilities. Current research into network …

Bayesian Online Natural Gradient (BONG)

M Jones, P Chang, K Murphy - arxiv preprint arxiv:2405.19681, 2024 - arxiv.org
We propose a novel approach to sequential Bayesian inference based on variational Bayes.
The key insight is that, in the online setting, we do not need to add the KL term to regularize …

Stein Variational Newton Neural Network Ensembles

K Flöge, MA Moeed, V Fortuin - arxiv preprint arxiv:2411.01887, 2024 - arxiv.org
Deep neural network ensembles are powerful tools for uncertainty quantification, which
have recently been re-interpreted from a Bayesian perspective. However, current methods …

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

TT Zhang, B Moniri, A Nagwekar, F Rahman… - arxiv preprint arxiv …, 2025 - arxiv.org
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms
that introduce preconditioners per axis of each layer's weight tensors. These methods have …

AdaFisher: Adaptive Second Order Optimization via Fisher Information

DM Gomes, Y Zhang, E Belilovsky, G Wolf… - arxiv preprint arxiv …, 2024 - arxiv.org
First-order optimization methods are currently the mainstream in training deep neural
networks (DNNs). Optimizers like Adam incorporate limited curvature information by …

Position: Curvature Matrices Should Be Democratized via Linear Operators

F Dangel, R Eschenhagen, W Ormaniec… - arxiv preprint arxiv …, 2025 - arxiv.org
Structured large matrices are prevalent in machine learning. A particularly important class is
curvature matrices like the Hessian, which are central to understanding the loss landscape …

Fast Fractional Natural Gradient Descent using Learnable Spectral Factorizations

W Lin, F Dangel, R Eschenhagen, J Bae, RE Turner… - openreview.net
Many popular optimization methods can be united through fractional natural gradient
descent (FNGD), which pre-conditions the gradient with a fractional power of the inverse …