Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages

T Nguyen, C Van Nguyen, VD Lai, H Man… - arxiv preprint arxiv …, 2023 - arxiv.org
The driving factors behind the development of large language models (LLMs) with
impressive learning capabilities are their colossal model sizes and extensive training …

What language model to train if you have one million GPU hours?

TL Scao, T Wang, D Hesslow, L Saulnier… - arxiv preprint arxiv …, 2022 - arxiv.org
The crystallization of modeling methods around the Transformer architecture has been a
boon for practitioners. Simple, well-motivated architectural variations can transfer across …

BLOOM+ 1: Adding language support to BLOOM for zero-shot prompting

ZX Yong, H Schoelkopf, N Muennighoff, AF Aji… - arxiv preprint arxiv …, 2022 - arxiv.org
The BLOOM model is a large publicly available multilingual language model, but its
pretraining was limited to 46 languages. To extend the benefits of BLOOM to other …

A critical analysis of the largest source for generative ai training data: Common crawl

S Baack - Proceedings of the 2024 ACM Conference on Fairness …, 2024 - dl.acm.org
Common Crawl is the largest freely available collection of web crawl data and one of the
most important sources of pre-training data for large language models (LLMs). It is used so …

Representation in AI evaluations

AS Bergman, LA Hendricks, M Rauh, B Wu… - Proceedings of the …, 2023 - dl.acm.org
Calls for representation in artificial intelligence (AI) and machine learning (ML) are
widespread, with" representation" or" representativeness" generally understood to be both …

Lonas: Elastic low-rank adapters for efficient large language models

JP Munoz, J Yuan, Y Zheng, N Jain - Proceedings of the 2024 …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) continue to grow, reaching hundreds of billions of
parameters and making it challenging for Deep Learning practitioners with resource …

Pivoine: Instruction tuning for open-world entity profiling

K Lu, X Pan, K Song, H Zhang, D Yu… - Findings of the …, 2023 - aclanthology.org
This work considers the problem of Open-world Entity Profiling, a sub-domain of Open-world
Information Extraction (Open-world IE). Unlike the conventional closed-world IE, Open-world …

Pivoine: Instruction tuning for open-world information extraction

K Lu, X Pan, K Song, H Zhang, D Yu, J Chen - arxiv preprint arxiv …, 2023 - arxiv.org
We consider the problem of Open-world Information Extraction (Open-world IE), which
extracts comprehensive entity profiles from unstructured texts. Different from the …

Spacerini: Plug-and-play search engines with Pyserini and Hugging Face

C Akiki, O Ogundepo, A Piktus, X Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
We present Spacerini, a modular framework for seamless building and deployment of
interactive search applications, designed to facilitate the qualitative analysis of large scale …

The Nordic Pile: A 1.2 TB Nordic dataset for language modeling

J Öhman, S Verlinden, A Ekgren, AC Gyllensten… - arxiv preprint arxiv …, 2023 - arxiv.org
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the
performance of the LLMs typically correlates with the scale and quality of the datasets. This …