Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Doremi: Optimizing data mixtures speeds up language model pretraining
The mixture proportions of pretraining data domains (eg, Wikipedia, books, web text) greatly
affect language model (LM) performance. In this paper, we propose Domain Reweighting …
affect language model (LM) performance. In this paper, we propose Domain Reweighting …
Data selection for language models via importance resampling
Selecting a suitable pretraining dataset is crucial for both general-domain (eg, GPT-3) and
domain-specific (eg, Codex) language models (LMs). We formalize this problem as selecting …
domain-specific (eg, Codex) language models (LMs). We formalize this problem as selecting …
Less: Selecting influential data for targeted instruction tuning
Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …
effectively using combined datasets to develop generalpurpose chatbots. However, real …
Data-efficient Fine-tuning for LLM-based Recommendation
Leveraging Large Language Models (LLMs) for recommendation has recently garnered
considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the …
considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the …
Dataset quantization
State-of-the-art deep neural networks are trained with large amounts (millions or even
billions) of data. The expensive computation and memory costs make it difficult to train them …
billions) of data. The expensive computation and memory costs make it difficult to train them …
Deepcore: A comprehensive library for coreset selection in deep learning
Coreset selection, which aims to select a subset of the most informative training samples, is
a long-standing learning problem that can benefit many downstream tasks such as data …
a long-standing learning problem that can benefit many downstream tasks such as data …
Quality not quantity: On the interaction between dataset design and robustness of clip
Web-crawled datasets have enabled remarkable generalization capabilities in recent image-
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …
Data distillation: A survey
The popularity of deep learning has led to the curation of a vast number of massive and
multifarious datasets. Despite having close-to-human performance on individual tasks …
multifarious datasets. Despite having close-to-human performance on individual tasks …
Gcr: Gradient coreset based replay buffer selection for continual learning
Continual learning (CL) aims to develop techniques by which a single model adapts to an
increasing number of tasks encountered sequentially, thereby potentially leveraging …
increasing number of tasks encountered sequentially, thereby potentially leveraging …
Condensing graphs via one-step gradient matching
As training deep learning models on large dataset takes a lot of time and resources, it is
desired to construct a small synthetic dataset with which we can train deep learning models …
desired to construct a small synthetic dataset with which we can train deep learning models …