Understanding and minimising outlier features in neural network training

B He, L Noci, D Paliotta, I Schlag… - arxiv preprint arxiv …, 2024 - arxiv.org
Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the
average over a neural network's (NN) width. They are well known to emerge during standard …

Cautious optimizers: Improving training with one line of code

K Liang, L Chen, B Liu, Q Liu - arxiv preprint arxiv:2411.16085, 2024 - arxiv.org
AdamW has been the default optimizer for transformer pretraining. For many years, our
community searches for faster and more stable optimizers with only constraint positive …

2 OLMo 2 Furious

T OLMo, P Walsh, L Soldaini, D Groeneveld… - arxiv preprint arxiv …, 2024 - arxiv.org
We present OLMo 2, the next generation of our fully open language models. OLMo 2
includes dense autoregressive models with improved architecture and training recipe …

Grams: Gradient descent with adaptive momentum scaling

Y Cao, X Li, Z Song - arxiv preprint arxiv:2412.17107, 2024 - arxiv.org
We introduce\textbf {Gr} adient Descent with\textbf {A} daptive\textbf {M} omentum\textbf {S}
caling (\textbf {Grams}), a novel optimization algorithm that decouples the direction and …

Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

S Wang, AK Bhartari, B Li, P Perdikaris - arxiv preprint arxiv:2502.00604, 2025 - arxiv.org
Multi-task learning through composite loss functions is fundamental to modern deep
learning, yet optimizing competing objectives remains challenging. We present new …

Avoiding spurious sharpness minimization broadens applicability of SAM

SP Singh, H Mobahi, A Agarwala… - arxiv preprint arxiv …, 2025 - arxiv.org
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown
great promise in improving generalization on vision tasks. However, we find that SAM …

Moonshine: Speech Recognition for Live Transcription and Voice Commands

N Jeffries, E King, M Kudlur, G Nicholson… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces Moonshine, a family of speech recognition models optimized for live
transcription and voice command processing. Moonshine is based on an encoder-decoder …

Physics of Skill Learning

Z Liu, Y Liu, EJ Michaud, J Gore, M Tegmark - arxiv preprint arxiv …, 2025 - arxiv.org
We aim to understand physics of skill learning, ie, how skills are learned in neural networks
during training. We start by observing the Domino effect, ie, skills are learned sequentially …

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

TT Zhang, B Moniri, A Nagwekar, F Rahman… - arxiv preprint arxiv …, 2025 - arxiv.org
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms
that introduce preconditioners per axis of each layer's weight tensors. These methods have …

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

S Nguyen, B Liu, L Chen, Q Liu - arxiv preprint arxiv:2502.07488, 2025 - arxiv.org
Modern adaptive optimization methods, such as Adam and its variants, have emerged as the
most widely used tools in deep learning over recent years. These algorithms offer automatic …