Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
[PDF][PDF] A survey of large language models
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …
of language intelligence by machine. Language is essentially a complex, intricate system of …
Less: Selecting influential data for targeted instruction tuning
Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …
effectively using combined datasets to develop generalpurpose chatbots. However, real …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
The quantization model of neural scaling
Abstract We propose the Quantization Model of neural scaling laws, explaining both the
observed power law dropoff of loss with model and data size, and also the sudden …
observed power law dropoff of loss with model and data size, and also the sudden …
Not all tokens are what you need for pretraining
Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that''Not all tokens in a …
prediction loss to all training tokens. Challenging this norm, we posit that''Not all tokens in a …
Rho-1: Not all tokens are what you need
Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that" 9l training". Our …
prediction loss to all training tokens. Challenging this norm, we posit that" 9l training". Our …
Datacomp-lm: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset
experiments with the goal of improving language models. As part of DCLM, we provide a …
experiments with the goal of improving language models. As part of DCLM, we provide a …
A tale of tails: Model collapse as a change of scaling laws
As AI model size grows, neural scaling laws have become a crucial tool to predict the
improvements of large models when increasing capacity and the size of original (human or …
improvements of large models when increasing capacity and the size of original (human or …
Dsdm: Model-aware dataset selection with datamodels
When selecting data for training large-scale models, standard practice is to filter for
examples that match human notions of data quality. Such filtering yields qualitatively clean …
examples that match human notions of data quality. Such filtering yields qualitatively clean …