Pythia: A suite for analyzing large language models across training and scaling

S Biderman, H Schoelkopf… - International …, 2023 - proceedings.mlr.press
How do large language models (LLMs) develop and evolve over the course of training?
How do these patterns change as models scale? To answer these questions, we introduce …

Deepseek llm: Scaling open-source language models with longtermism

X Bi, D Chen, G Chen, S Chen, D Dai, C Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of open-source large language models (LLMs) has been truly
remarkable. However, the scaling law described in previous literature presents varying …

Training compute-optimal large language models

J Hoffmann, S Borgeaud, A Mensch… - arxiv preprint arxiv …, 2022 - arxiv.org
We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …

Studying large language model generalization with influence functions

R Grosse, J Bae, C Anil, N Elhage, A Tamkin… - arxiv preprint arxiv …, 2023 - arxiv.org
When trying to gain better visibility into a machine learning model in order to understand and
mitigate the associated risks, a potentially valuable source of evidence is: which training …

Why transformers need adam: A hessian perspective

Y Zhang, C Chen, T Ding, Z Li… - Advances in Neural …, 2025 - proceedings.neurips.cc
SGD performs worse than Adam by a significant margin on Transformers, but the reason
remains unclear. In this work, we provide an explanation through the lens of Hessian:(i) …

An empirical analysis of compute-optimal large language model training

J Hoffmann, S Borgeaud, A Mensch… - Advances in neural …, 2022 - proceedings.neurips.cc
We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …

[PDF][PDF] Scaling laws for neural language models

J Kaplan, S McCandlish, T Henighan, TB Brown… - arxiv preprint arxiv …, 2020 - arxiv.org
We study empirical scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute …

Lookahead optimizer: k steps forward, 1 step back

M Zhang, J Lucas, J Ba… - Advances in neural …, 2019 - proceedings.neurips.cc
The vast majority of successful deep neural networks are trained using variants of stochastic
gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly …

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev… - Advances in Neural …, 2025 - proceedings.neurips.cc
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

Gradient norm aware minimization seeks first-order flatness and improves generalization

X Zhang, R Xu, H Yu, H Zou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recently, flat minima are proven to be effective for improving generalization and sharpness-
aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of …