Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
[PDF][PDF] A survey of large language models
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering
of language intelligence by machine. Language is essentially a complex, intricate system of …
of language intelligence by machine. Language is essentially a complex, intricate system of …
Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully
selected subsets of massive web scrapes. For instance the LAION public dataset retained …
selected subsets of massive web scrapes. For instance the LAION public dataset retained …
Datacomp-lm: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset
experiments with the goal of improving language models. As part of DCLM, we provide a …
experiments with the goal of improving language models. As part of DCLM, we provide a …
Scaling synthetic data creation with 1,000,000,000 personas
We propose a novel persona-driven data synthesis methodology that leverages various
perspectives within a large language model (LLM) to create diverse synthetic data. To fully …
perspectives within a large language model (LLM) to create diverse synthetic data. To fully …
Cinepile: A long video question answering dataset and benchmark
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …
form comprehension challenges, as many tasks derived from these datasets can be …
Zamba: A compact 7b ssm hybrid model
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …
achieves competitive performance against leading open-weight models at a comparable …
Reverse training to nurse the reversal curse
Large language models (LLMs) have a surprising failure: when trained on" A has a feature
B", they do not generalize to" B is a feature of A", which is termed the Reversal Curse. Even …
B", they do not generalize to" B is a feature of A", which is termed the Reversal Curse. Even …
Instruction pre-training: Language models are supervised multitask learners
Unsupervised multitask pre-training has been the critical method behind the recent success
of language models (LMs). However, supervised multitask learning still holds significant …
of language models (LMs). However, supervised multitask learning still holds significant …
A survey on data synthesis and augmentation for large language models
K Wang, J Zhu, M Ren, Z Liu, S Li, Z Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The success of Large Language Models (LLMs) is inherently linked to the availability of vast,
diverse, and high-quality data for training and evaluation. However, the growth rate of high …
diverse, and high-quality data for training and evaluation. However, the growth rate of high …
Mates: Model-aware data selection for efficient pretraining with data influence models
Z Yu, S Das, C **ong - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …
by utilizing higher-quality data from massive web data corpora. Current data selection …