Google Acadèmic

Desa Cita Citat per 217 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

Deepseek llm: Scaling open-source language models with longtermism

X Bi, D Chen, G Chen, S Chen, D Dai, C Deng… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid development of open-source large language models (LLMs) has been truly
remarkable. However, the scaling law described in previous literature presents varying …

Desa Cita Citat per 1928 Articles relacionats Totes les 11 versions Free GPT-4 DeepSeek Versió HTML

Training compute-optimal large language models

J Hoffmann, S Borgeaud, A Mensch… - arxiv preprint arxiv …, 2022 - arxiv.org

We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …

Desa Cita Citat per 133 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

Studying large language model generalization with influence functions

R Grosse, J Bae, C Anil, N Elhage, A Tamkin… - arxiv preprint arxiv …, 2023 - arxiv.org

When trying to gain better visibility into a machine learning model in order to understand and
mitigate the associated risks, a potentially valuable source of evidence is: which training …

Desa Cita Citat per 29 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

Why transformers need adam: A hessian perspective

Y Zhang, C Chen, T Ding, Z Li… - Advances in Neural …, 2025 - proceedings.neurips.cc

SGD performs worse than Adam by a significant margin on Transformers, but the reason
remains unclear. In this work, we provide an explanation through the lens of Hessian:(i) …

Desa Cita Citat per 147 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

An empirical analysis of compute-optimal large language model training

J Hoffmann, S Borgeaud, A Mensch… - Advances in neural …, 2022 - proceedings.neurips.cc

We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …

Desa Cita Citat per 2958 Articles relacionats Totes les 8 versions Free GPT-4 DeepSeek Versió HTML

[PDF][PDF] Scaling laws for neural language models

J Kaplan, S McCandlish, T Henighan, TB Brown… - arxiv preprint arxiv …, 2020 - arxiv.org

We study empirical scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute …

Desa Cita Citat per 864 Articles relacionats Totes les 14 versions Free GPT-4 DeepSeek Versió HTML

Lookahead optimizer: k steps forward, 1 step back

M Zhang, J Lucas, J Ba… - Advances in neural …, 2019 - proceedings.neurips.cc

The vast majority of successful deep neural networks are trained using variants of stochastic
gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly …

Desa Cita Citat per 11 Articles relacionats Totes les 5 versions Free GPT-4 DeepSeek Versió HTML

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev… - Advances in Neural …, 2025 - proceedings.neurips.cc

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …