The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ... arXiv preprint arXiv:2306.01116, 2023 | 740 | 2023 |
The falcon series of open language models E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ... arXiv preprint arXiv:2311.16867, 2023 | 435 | 2023 |
Falcon-40B: an open large language model with state-of-the-art performance E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ... | 249 | 2023 |
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only G Penedo, Q Malartic, D Hesslow, R Cojocaru, H Alobeidli, A Cappelli, ... Advances in Neural Information Processing Systems 36, 79155-79172, 2023 | 114 | 2023 |
The fineweb datasets: Decanting the web for the finest text data at scale G Penedo, H Kydlíček, A Lozhkov, M Mitchell, C Raffel, L Von Werra, ... The Thirty-eight Conference on Neural Information Processing Systems …, 2024 | 57 | 2024 |
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf G Penedo, H Kydlıcek The fineweb datasets: Decanting the web for the finest text data at scale 6, 2024 | 52 | 2024 |
The falcon series of language models: Towards open frontier models E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, ... Hugging Face repository, 2023 | 38 | 2023 |
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv 2023 G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ... arXiv preprint arXiv:2306.01116, 0 | 34 | |
The falcon series of open language models, 2023 E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ... arXiv preprint arXiv:2311.16867, 2023 | 21 | 2023 |
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The fineweb datasets: Decanting the web for the finest text data at … G Penedo, H Kydlícek arXiv preprint arXiv:2406.17557, 0 | 18 | |
AlGhafa evaluation benchmark for Arabic language models E Almazrouei, R Cojocaru, M Baldo, Q Malartic, H Alobeidli, D Mazzotta, ... Proceedings of ArabicNLP 2023, 244-275, 2023 | 13 | 2023 |
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model LB Allal, A Lozhkov, E Bakouch, GM Blázquez, G Penedo, L Tunstall, ... arXiv preprint arXiv:2502.02737, 2025 | 3 | 2025 |
Towards Best Practices for Open Datasets for LLM Training S Baack, S Biderman, K Odrozek, A Skowron, A Bdeir, J Bommarito, ... arXiv preprint arXiv:2501.08365, 2025 | | 2025 |
Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS L García Mozos, D Saroya, Y Roelvink, N Santos D'Amore, S Gabetti, ... 4th Symposium on Space Educational Activities, 2022 | | 2022 |