- Academic Search

B He, L Noci, D Paliotta, I Schlag… - arxiv preprint arxiv …, 2024 - arxiv.org

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the
average over a neural network's (NN) width. They are well known to emerge during standard …

Uložit Citovat Počet citací tohoto článku: 4 Související články Všechny verze (počet: 5) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cautious optimizers: Improving training with one line of code

K Liang, L Chen, B Liu, Q Liu - arxiv preprint arxiv:2411.16085, 2024 - arxiv.org

AdamW has been the default optimizer for transformer pretraining. For many years, our
community searches for faster and more stable optimizers with only constraint positive …

Uložit Citovat Počet citací tohoto článku: 4 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

2 OLMo 2 Furious

T OLMo, P Walsh, L Soldaini, D Groeneveld… - arxiv preprint arxiv …, 2024 - arxiv.org

We present OLMo 2, the next generation of our fully open language models. OLMo 2
includes dense autoregressive models with improved architecture and training recipe …

Uložit Citovat Počet citací tohoto článku: 5 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Grams: Gradient descent with adaptive momentum scaling

Y Cao, X Li, Z Song - arxiv preprint arxiv:2412.17107, 2024 - arxiv.org

We introduce\textbf {Gr} adient Descent with\textbf {A} daptive\textbf {M} omentum\textbf {S}
caling (\textbf {Grams}), a novel optimization algorithm that decouples the direction and …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

S Wang, AK Bhartari, B Li, P Perdikaris - arxiv preprint arxiv:2502.00604, 2025 - arxiv.org

Multi-task learning through composite loss functions is fundamental to modern deep
learning, yet optimizing competing objectives remains challenging. We present new …

Uložit Citovat Počet citací tohoto článku: 1 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Avoiding spurious sharpness minimization broadens applicability of SAM

SP Singh, H Mobahi, A Agarwala… - arxiv preprint arxiv …, 2025 - arxiv.org

Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown
great promise in improving generalization on vision tasks. However, we find that SAM …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Moonshine: Speech Recognition for Live Transcription and Voice Commands

N Jeffries, E King, M Kudlur, G Nicholson… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces Moonshine, a family of speech recognition models optimized for live
transcription and voice command processing. Moonshine is based on an encoder-decoder …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Physics of Skill Learning

Z Liu, Y Liu, EJ Michaud, J Gore, M Tegmark - arxiv preprint arxiv …, 2025 - arxiv.org

We aim to understand physics of skill learning, ie, how skills are learned in neural networks
during training. We start by observing the Domino effect, ie, skills are learned sequentially …

Uložit Citovat Související články Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

TT Zhang, B Moniri, A Nagwekar, F Rahman… - arxiv preprint arxiv …, 2025 - arxiv.org

Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms
that introduce preconditioners per axis of each layer's weight tensors. These methods have …

Uložit Citovat Související články Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

S Nguyen, B Liu, L Chen, Q Liu - arxiv preprint arxiv:2502.07488, 2025 - arxiv.org

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the
most widely used tools in deep learning over recent years. These algorithms offer automatic …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Soap: Improving and stabilizing shampoo using adam

Understanding and minimising outlier features in neural network training

Cautious optimizers: Improving training with one line of code

2 OLMo 2 Furious

Grams: Gradient descent with adaptive momentum scaling

Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

Avoiding spurious sharpness minimization broadens applicability of SAM

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Physics of Skill Learning

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Improving Adaptive Moment Optimization via Preconditioner Diagonalization