Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
The pile: An 800gb dataset of diverse text for language modeling
Recent work has demonstrated that increased training dataset diversity improves general
cross-domain knowledge and downstream generalization capability for large-scale …
cross-domain knowledge and downstream generalization capability for large-scale …
Understanding contrastive representation learning through alignment and uniformity on the hypersphere
Contrastive representation learning has been outstandingly successful in practice. In this
work, we identify two key properties related to the contrastive loss:(1) alignment (closeness) …
work, we identify two key properties related to the contrastive loss:(1) alignment (closeness) …
Laco: Large language model pruning via layer collapse
Large language models (LLMs) based on transformer are witnessing a notable trend of size
expansion, which brings considerable costs to both model training and inference. However …
expansion, which brings considerable costs to both model training and inference. However …
Datasheet for the pile
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …
Addressing" documentation debt" in machine learning research: A retrospective datasheet for bookcorpus
Recent literature has underscored the importance of dataset documentation work for
machine learning, and part of this work involves addressing" documentation debt" for …
machine learning, and part of this work involves addressing" documentation debt" for …
Low frequency names exhibit bias and overfitting in contextualizing language models
We use a dataset of US first names with labels based on predominant gender and racial
group to examine the effect of training corpus frequency on tokenization, contextualization …
group to examine the effect of training corpus frequency on tokenization, contextualization …
Addressing" documentation debt" in machine learning: A retrospective datasheet for bookcorpus
This paper contributes a formal case study in retrospective dataset documentation and
pinpoints several problems with the influential BookCorpus dataset. Recent work has …
pinpoints several problems with the influential BookCorpus dataset. Recent work has …
LLMs and memorization: On quality and specificity of copyright compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have been
shown to easily reproduce parts of their training data, including copyrighted work. This is an …
shown to easily reproduce parts of their training data, including copyrighted work. This is an …
Never too late to learn: Regularizing gender bias in coreference resolution
Leveraging pre-trained language models (PLMs) as initializers for efficient transfer learning
has become a universal approach for text-related tasks. However, the models not only learn …
has become a universal approach for text-related tasks. However, the models not only learn …
Logigan: Learning logical reasoning via adversarial pre-training
We present LogiGAN, an unsupervised adversarial pre-training framework for improving
logical reasoning abilities of language models. Upon automatic identification of logical …
logical reasoning abilities of language models. Upon automatic identification of logical …