Følg
Guilherme Penedo
Guilherme Penedo
ML Research Engineer at 🤗 HuggingFace
Verificeret mail på huggingface.co
Titel
Citeret af
Citeret af
År
The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only
G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ...
arXiv preprint arXiv:2306.01116, 2023
7402023
The falcon series of open language models
E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ...
arXiv preprint arXiv:2311.16867, 2023
4352023
Falcon-40B: an open large language model with state-of-the-art performance
E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ...
2492023
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only
G Penedo, Q Malartic, D Hesslow, R Cojocaru, H Alobeidli, A Cappelli, ...
Advances in Neural Information Processing Systems 36, 79155-79172, 2023
1142023
The fineweb datasets: Decanting the web for the finest text data at scale
G Penedo, H Kydlíček, A Lozhkov, M Mitchell, C Raffel, L Von Werra, ...
The Thirty-eight Conference on Neural Information Processing Systems …, 2024
572024
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf
G Penedo, H Kydlıcek
The fineweb datasets: Decanting the web for the finest text data at scale 6, 2024
522024
The falcon series of language models: Towards open frontier models
E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, ...
Hugging Face repository, 2023
382023
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv 2023
G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ...
arXiv preprint arXiv:2306.01116, 0
34
The falcon series of open language models, 2023
E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ...
arXiv preprint arXiv:2311.16867, 2023
212023
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The fineweb datasets: Decanting the web for the finest text data at …
G Penedo, H Kydlícek
arXiv preprint arXiv:2406.17557, 0
18
AlGhafa evaluation benchmark for Arabic language models
E Almazrouei, R Cojocaru, M Baldo, Q Malartic, H Alobeidli, D Mazzotta, ...
Proceedings of ArabicNLP 2023, 244-275, 2023
132023
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model
LB Allal, A Lozhkov, E Bakouch, GM Blázquez, G Penedo, L Tunstall, ...
arXiv preprint arXiv:2502.02737, 2025
32025
Towards Best Practices for Open Datasets for LLM Training
S Baack, S Biderman, K Odrozek, A Skowron, A Bdeir, J Bommarito, ...
arXiv preprint arXiv:2501.08365, 2025
2025
Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS
L García Mozos, D Saroya, Y Roelvink, N Santos D'Amore, S Gabetti, ...
4th Symposium on Space Educational Activities, 2022
2022
Systemet kan ikke foretage handlingen nu. Prøv igen senere.
Artikler 1–14