Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Datasets for large language models: A comprehensive survey
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
[PDF][PDF] A survey of large language models
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …
of language intelligence by machine. Language is essentially a complex, intricate system of …
xlstm: Extended long short-term memory
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity
Pretraining data design is critically under-documented and often guided by empirically
unsupported intuitions. We pretrain models on data curated (1) at different collection …
unsupported intuitions. We pretrain models on data curated (1) at different collection …
Redpajama: an open dataset for training large language models
Large language models are increasingly becoming a cornerstone technology in artificial
intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset …
intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset …
Openmoe: An early effort on open mixture-of-experts language models
To help the open-source community have a better understanding of Mixture-of-Experts
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …
D-cpt law: Domain-specific continual pre-training scaling law for large language models
Abstract Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely
used to expand the model's fundamental understanding of specific downstream domains …
used to expand the model's fundamental understanding of specific downstream domains …
A survey of multimodal large language model from a data-centric perspective
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …
language models by integrating and processing data from multiple modalities, including text …
Openelm: An efficient language model family with open training and inference framework
The reproducibility and transparency of large language models are crucial for advancing
open research, ensuring the trustworthiness of results, and enabling investigations into data …
open research, ensuring the trustworthiness of results, and enabling investigations into data …