Prati
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Ostala imenaPedro Javier Ortiz Suárez
Senior Research Scientist, Common Crawl Foundation
Potvrđena adresa e-pošte na commoncrawl.org - Početna stranica
Naslov
Citirano
Citirano
Godina
Bloom: A 176b-parameter open-access multilingual language model
T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, R Castagné, ...
17872023
CamemBERT: a Tasty French Language Model
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
13272020
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
PJ Ortiz Suárez, B Sagot, L Romary
7th Workshop on the Challenges in the Management of Large Corpora, 2019
510*2019
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
J Kreutzer, I Caswell, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
Transactions of the Association for Computational Linguistics 10, 50-72, 2022
296*2022
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
PJ Ortiz Suárez, L Romary, B Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
253*2020
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv eprints, page
J Abadji, P Ortiz Suarez, L Romary, B Sagot
arXiv preprint arXiv:2201.06642, 2022
1922022
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ...
Advances in Neural Information Processing Systems 35, 31809-31826, 2022
1892022
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
J Abadji, PJO Suárez, L Romary, B Sagot
CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021
682021
Building a user-generated content north-african arabizi treebank: Tackling hell
D Seddah, F Essaidi, A Fethi, M Futeral, B Muller, PJ Ortiz Suárez, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
542020
Tokenizer choice for llm training: Negligible or crucial?
M Ali, M Fromm, K Thellmann, R Rutmann, M Lübbering, J Leveling, ...
Findings of the Association for Computational Linguistics: NAACL 2024, 3907-3924, 2024
282024
Establishing a New State-of-the-Art for French Named Entity Recognition
PJ Ortiz Suárez, Y Dupont, B Muller, L Romary, B Sagot
Proceedings of The 12th Language Resources and Evaluation Conference, 4631–4638, 2020
26*2020
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
A McMillan-Major, Z Alyafeai, S Biderman, K Chen, F De Toni, G Dupont, ...
arXiv preprint arXiv:2201.10066, 2022
192022
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
S Gabay, P Ortiz Suarez, A Bartz, A Chagué, R Bawden, P Gambette, ...
arXiv preprint arXiv:2202.09452, 2022
162022
Automatic extraction of materials and properties from superconductors scientific literature
L Foppiano, PB Castro, P Ortiz Suarez, K Terashima, Y Takano, M Ishii
Science and Technology of Advanced Materials: Methods 3 (1), 2153633, 2023
152023
Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data
T Jansen, Y Tong, V Zevallos, PO Suarez
arXiv preprint arXiv:2212.10440, 2022
142022
Bertrade: Using contextual embeddings to parse old french
L Grobol, M Regnault, PO Suarez, B Sagot, L Romary, B Crabbé
13th Language Resources and Evaluation Conference, 2022
112022
Les modèles de langue contextuels Camembert pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, E Clergerie, ...
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP …, 2020
112020
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
PJ Ortiz Suárez, Y Dupont, G Lejeune, T Tian
CLEF 2020 Working Notes 2696, 2020
8*2020
Semi-automatic staging area for high-quality structured data extraction from scientific literature
L Foppiano, T Mato, K Terashima, P Ortiz Suarez, T Tou, C Sakai, ...
Science and Technology of Advanced Materials: Methods 3 (1), 2286219, 2023
32023
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
M Popa-Fabre, PJ Ortiz Suárez, B Sagot, ÉV de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large …, 2020
32020
Sustav trenutno ne može provesti ovu radnju. Pokušajte ponovo kasnije.
Članci 1–20