Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arxiv preprint arxiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Languages through the looking glass of BPE compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language

S Zhang, B Frey, M Bansal - arxiv preprint arxiv:2204.11909, 2022 - arxiv.org
More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …

DivEMT: Neural machine translation post-editing effort across typologically diverse languages

G Sarti, A Bisazza, AG Arenas, A Toral - arxiv preprint arxiv:2205.12215, 2022 - arxiv.org
We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …

A survey on text classification: Practical perspectives on the Italian language

A Gasparetto, A Zangari, M Marcuzzo, A Albarelli - Plos one, 2022 - journals.plos.org
Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …

Beyond characters: Subword-level morpheme segmentation

B Peters, AFT Martins - … of the 19th SIGMORPHON Workshop on …, 2022 - aclanthology.org
This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …

Quantifying synthesis and fusion and their impact on machine translation

A Oncevay, D Ataman, N Van Berkel, B Haddow… - arxiv preprint arxiv …, 2022 - arxiv.org
Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …

[HTML][HTML] DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain

H Pires, L Paucar, JP Carvalho - Big Data and Cognitive Computing, 2025 - mdpi.com
The complex and specialized terminology of financial language in Portuguese-speaking
markets create significant challenges for natural language processing (NLP) applications …

Impact of subword pooling strategy on cross-lingual event detection

S Agarwal, S Fincke, C Jenkins, S Miller… - arxiv preprint arxiv …, 2023 - arxiv.org
Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …