Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Pythia: A suite for analyzing large language models across training and scaling
How do large language models (LLMs) develop and evolve over the course of training?
How do these patterns change as models scale? To answer these questions, we introduce …
How do these patterns change as models scale? To answer these questions, we introduce …
Deepseek llm: Scaling open-source language models with longtermism
The rapid development of open-source large language models (LLMs) has been truly
remarkable. However, the scaling law described in previous literature presents varying …
remarkable. However, the scaling law described in previous literature presents varying …
Training compute-optimal large language models
We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …
language model under a given compute budget. We find that current large language models …
Studying large language model generalization with influence functions
When trying to gain better visibility into a machine learning model in order to understand and
mitigate the associated risks, a potentially valuable source of evidence is: which training …
mitigate the associated risks, a potentially valuable source of evidence is: which training …
Why transformers need adam: A hessian perspective
SGD performs worse than Adam by a significant margin on Transformers, but the reason
remains unclear. In this work, we provide an explanation through the lens of Hessian:(i) …
remains unclear. In this work, we provide an explanation through the lens of Hessian:(i) …
An empirical analysis of compute-optimal large language model training
We investigate the optimal model size and number of tokens for training a transformer
language model under a given compute budget. We find that current large language models …
language model under a given compute budget. We find that current large language models …
[PDF][PDF] Scaling laws for neural language models
We study empirical scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute …
The loss scales as a power-law with model size, dataset size, and the amount of compute …
Lookahead optimizer: k steps forward, 1 step back
The vast majority of successful deep neural networks are trained using variants of stochastic
gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly …
gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly …
Resolving discrepancies in compute-optimal scaling of language models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …
size as a function of the compute budget, but these laws yield substantially different …
Gradient norm aware minimization seeks first-order flatness and improves generalization
Recently, flat minima are proven to be effective for improving generalization and sharpness-
aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of …
aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of …