The pile: An 800gb dataset of diverse text for language modeling

L Gao, S Biderman, S Black, L Golding… - arxiv preprint arxiv …, 2020 - arxiv.org
Recent work has demonstrated that increased training dataset diversity improves general
cross-domain knowledge and downstream generalization capability for large-scale …

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

T Wang, P Isola - International conference on machine …, 2020 - proceedings.mlr.press
Contrastive representation learning has been outstandingly successful in practice. In this
work, we identify two key properties related to the contrastive loss:(1) alignment (closeness) …

Laco: Large language model pruning via layer collapse

Y Yang, Z Cao, H Zhao - arxiv preprint arxiv:2402.11187, 2024 - arxiv.org
Large language models (LLMs) based on transformer are witnessing a notable trend of size
expansion, which brings considerable costs to both model training and inference. However …

Datasheet for the pile

S Biderman, K Bicheno, L Gao - arxiv preprint arxiv:2201.07311, 2022 - arxiv.org
This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by
EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different …

Addressing" documentation debt" in machine learning research: A retrospective datasheet for bookcorpus

J Bandy, N Vincent - arxiv preprint arxiv:2105.05241, 2021 - arxiv.org
Recent literature has underscored the importance of dataset documentation work for
machine learning, and part of this work involves addressing" documentation debt" for …

Low frequency names exhibit bias and overfitting in contextualizing language models

R Wolfe, A Caliskan - arxiv preprint arxiv:2110.00672, 2021 - arxiv.org
We use a dataset of US first names with labels based on predominant gender and racial
group to examine the effect of training corpus frequency on tokenization, contextualization …

Addressing" documentation debt" in machine learning: A retrospective datasheet for bookcorpus

J Bandy, N Vincent - Thirty-fifth Conference on Neural Information …, 2021 - openreview.net
This paper contributes a formal case study in retrospective dataset documentation and
pinpoints several problems with the influential BookCorpus dataset. Recent work has …

LLMs and memorization: On quality and specificity of copyright compliance

FB Mueller, R Görge, AK Bernzen, JC Pirk… - Proceedings of the …, 2024 - ojs.aaai.org
Memorization in large language models (LLMs) is a growing concern. LLMs have been
shown to easily reproduce parts of their training data, including copyrighted work. This is an …

Never too late to learn: Regularizing gender bias in coreference resolution

SY Park, K Choi, H Yu, Y Ko - … Conference on Web Search and Data …, 2023 - dl.acm.org
Leveraging pre-trained language models (PLMs) as initializers for efficient transfer learning
has become a universal approach for text-related tasks. However, the models not only learn …

Logigan: Learning logical reasoning via adversarial pre-training

X Pi, W Zhong, Y Gao, N Duan… - Advances in Neural …, 2022 - proceedings.neurips.cc
We present LogiGAN, an unsupervised adversarial pre-training framework for improving
logical reasoning abilities of language models. Upon automatic identification of logical …