Theo dõi
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Tên khácPedro Javier Ortiz Suárez
Senior Research Scientist, Common Crawl Foundation
Email được xác minh tại commoncrawl.org - Trang chủ
Tiêu đề
Trích dẫn bởi
Trích dẫn bởi
Năm
Bloom: A 176b-parameter open-access multilingual language model
T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, R Castagné, ...
18012023
CamemBERT: a Tasty French Language Model
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
13292020
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
PJ Ortiz Suárez, B Sagot, L Romary
7th Workshop on the Challenges in the Management of Large Corpora, 2019
511*2019
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
J Kreutzer, I Caswell, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
Transactions of the Association for Computational Linguistics 10, 50-72, 2022
299*2022
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
PJ Ortiz Suárez, L Romary, B Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
253*2020
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv eprints, page
J Abadji, P Ortiz Suarez, L Romary, B Sagot
arXiv preprint arXiv:2201.06642, 2022
1932022
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ...
Advances in Neural Information Processing Systems 35, 31809-31826, 2022
1922022
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
J Abadji, PJO Suárez, L Romary, B Sagot
CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021
692021
Building a user-generated content north-african arabizi treebank: Tackling hell
D Seddah, F Essaidi, A Fethi, M Futeral, B Muller, PJ Ortiz Suárez, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
542020
Tokenizer choice for llm training: Negligible or crucial?
M Ali, M Fromm, K Thellmann, R Rutmann, M Lübbering, J Leveling, ...
Findings of the Association for Computational Linguistics: NAACL 2024, 3907-3924, 2024
282024
Establishing a New State-of-the-Art for French Named Entity Recognition
PJ Ortiz Suárez, Y Dupont, B Muller, L Romary, B Sagot
Proceedings of The 12th Language Resources and Evaluation Conference, 4631–4638, 2020
26*2020
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
A McMillan-Major, Z Alyafeai, S Biderman, K Chen, F De Toni, G Dupont, ...
arXiv preprint arXiv:2201.10066, 2022
192022
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
S Gabay, P Ortiz Suarez, A Bartz, A Chagué, R Bawden, P Gambette, ...
arXiv preprint arXiv:2202.09452, 2022
162022
Automatic extraction of materials and properties from superconductors scientific literature
L Foppiano, PB Castro, P Ortiz Suarez, K Terashima, Y Takano, M Ishii
Science and Technology of Advanced Materials: Methods 3 (1), 2153633, 2023
152023
Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data
T Jansen, Y Tong, V Zevallos, PO Suarez
arXiv preprint arXiv:2212.10440, 2022
142022
Bertrade: Using contextual embeddings to parse old french
L Grobol, M Regnault, PO Suarez, B Sagot, L Romary, B Crabbé
13th Language Resources and Evaluation Conference, 2022
112022
Les modèles de langue contextuels Camembert pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, E Clergerie, ...
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP …, 2020
112020
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
PJ Ortiz Suárez, Y Dupont, G Lejeune, T Tian
CLEF 2020 Working Notes 2696, 2020
8*2020
Semi-automatic staging area for high-quality structured data extraction from scientific literature
L Foppiano, T Mato, K Terashima, P Ortiz Suarez, T Tou, C Sakai, ...
Science and Technology of Advanced Materials: Methods 3 (1), 2286219, 2023
32023
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
M Popa-Fabre, PJ Ortiz Suárez, B Sagot, ÉV de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large …, 2020
32020
Hệ thống không thể thực hiện thao tác ngay bây giờ. Hãy thử lại sau.
Bài viết 1–20