On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arxiv preprint arxiv …, 2023 - arxiv.org
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

Obelics: An open web-scale filtered dataset of interleaved image-text documents

H Laurençon, L Saulnier, L Tronchon… - Advances in …, 2023 - proceedings.neurips.cc
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …

Scaling data-constrained language models

N Muennighoff, A Rush, B Barak… - Advances in …, 2023 - proceedings.neurips.cc
The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

J Dodge, M Sap, A Marasović, W Agnew… - arxiv preprint arxiv …, 2021 - arxiv.org
Large language models have led to remarkable progress on many NLP tasks, and
researchers are turning to ever-larger text corpora to train them. Some of the largest corpora …

The interplay of variant, size, and task type in Arabic pre-trained language models

G Inoue, B Alhafni, N Baimukan, H Bouamor… - arxiv preprint arxiv …, 2021 - arxiv.org
In this paper, we explore the effects of language variants, data sizes, and fine-tuning task
types in Arabic pre-trained language models. To do so, we build three pre-trained language …

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages

T Nguyen, C Van Nguyen, VD Lai, H Man… - arxiv preprint arxiv …, 2023 - arxiv.org
The driving factors behind the development of large language models (LLMs) with
impressive learning capabilities are their colossal model sizes and extensive training …

Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media

A Safaya, M Abdullatif, D Yuret - arxiv preprint arxiv:2007.13184, 2020 - arxiv.org
In this paper, we describe our approach to utilize pre-trained BERT models with
Convolutional Neural Networks for sub-task A of the Multilingual Offensive Language …

Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages

K Ogueji, Y Zhu, J Lin - Proceedings of the 1st workshop on …, 2021 - aclanthology.org
Pretrained multilingual language models have been shown to work well on many languages
for a variety of downstream NLP tasks. However, these models are known to require a lot of …

Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories

M Pellert, CM Lechner, C Wagner… - Perspectives on …, 2024 - journals.sagepub.com
We illustrate how standard psychometric inventories originally designed for assessing
noncognitive human traits can be repurposed as diagnostic tools to evaluate analogous …