‪Guilherme Penedo‬ - ‪Google Académico‬

Obter o meu próprio perfil

Citado por

	Todos	Desde 2020
Citações	1774	1774
Índice h	11	11
Índice i10	11	11

0

1300

650

325

975

20222023202420255 308 1284 167

Guilherme Penedo

Guilherme Penedo

ML Research Engineer at 🤗 HuggingFace

Email confirmado em huggingface.co


Título Ordenar por citações Ordenar por ano Ordenar por título	Citado por Citado por	Ano
The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ... arXiv preprint arXiv:2306.01116, 2023	740	2023
The falcon series of open language models E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ... arXiv preprint arXiv:2311.16867, 2023	435	2023
Falcon-40B: an open large language model with state-of-the-art performance E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ...	249	2023
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only G Penedo, Q Malartic, D Hesslow, R Cojocaru, H Alobeidli, A Cappelli, ... Advances in Neural Information Processing Systems 36, 79155-79172, 2023	114	2023
The fineweb datasets: Decanting the web for the finest text data at scale G Penedo, H Kydlíček, A Lozhkov, M Mitchell, C Raffel, L Von Werra, ... The Thirty-eight Conference on Neural Information Processing Systems …, 2024	57	2024
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf G Penedo, H Kydlıcek The fineweb datasets: Decanting the web for the finest text data at scale 6, 2024	52	2024
The falcon series of language models: Towards open frontier models E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, ... Hugging Face repository, 2023	38	2023
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv 2023 G Penedo, Q Malartic, D Hesslow, R Cojocaru, A Cappelli, H Alobeidli, ... arXiv preprint arXiv:2306.01116, 0	34
The falcon series of open language models, 2023 E Almazrouei, H Alobeidli, A Alshamsi, A Cappelli, R Cojocaru, M Debbah, ... arXiv preprint arXiv:2311.16867, 2023	21	2023
Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The fineweb datasets: Decanting the web for the finest text data at … G Penedo, H Kydlícek arXiv preprint arXiv:2406.17557, 0	18
AlGhafa evaluation benchmark for Arabic language models E Almazrouei, R Cojocaru, M Baldo, Q Malartic, H Alobeidli, D Mazzotta, ... Proceedings of ArabicNLP 2023, 244-275, 2023	13	2023
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model LB Allal, A Lozhkov, E Bakouch, GM Blázquez, G Penedo, L Tunstall, ... arXiv preprint arXiv:2502.02737, 2025	3	2025
Towards Best Practices for Open Datasets for LLM Training S Baack, S Biderman, K Odrozek, A Skowron, A Bdeir, J Bommarito, ... arXiv preprint arXiv:2501.08365, 2025		2025
Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS L García Mozos, D Saroya, Y Roelvink, N Santos D'Amore, S Gabetti, ... 4th Symposium on Space Educational Activities, 2022		2022

O sistema não pode efectuar a operação agora. Tente mais tarde.

Artigos 1–14