The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only

G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023 - proceedings.neurips.cc
Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …

D4: Improving llm pretraining via document de-duplication and diversification

K Tirumala, D Simig, A Aghajanyan… - Advances in Neural …, 2023 - proceedings.neurips.cc
Over recent years, an increasing amount of compute and data has been poured into training
large language models (LLMs), usually by doing one-pass learning on as many tokens as …

Dolma: An open corpus of three trillion tokens for language model pretraining research

L Soldaini, R Kinney, A Bhagia, D Schwenk… - arxiv preprint arxiv …, 2024 - arxiv.org
Information about pretraining corpora used to train the current best-performing language
models is seldom discussed: commercial models rarely detail their data, and even open …

Verilogeval: Evaluating large language models for verilog code generation

M Liu, N Pinckney, B Khailany… - 2023 IEEE/ACM …, 2023 - ieeexplore.ieee.org
The increasing popularity of large language models (LLMs) has paved the way for their
application in diverse domains. This paper proposes a benchmarking framework tailored …