Survey of low-resource machine translation

B Haddow, R Bawden, AVM Barone, J Helcl… - Computational …, 2022‏ - direct.mit.edu
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …

Cross-lingual name tagging and linking for 282 languages

X Pan, B Zhang, J May, J Nothman… - Proceedings of the …, 2017‏ - aclanthology.org
The ambitious goal of this work is to develop a cross-lingual name tagging and linking
framework for 282 languages that exist in Wikipedia. Given a document in any of these …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arxiv preprint arxiv …, 2021‏ - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Languages through the looking glass of BPE compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023‏ - direct.mit.edu
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

The SIGMORPHON 2022 shared task on morpheme segmentation

K Batsuren, G Bella, A Arora, V Martinović… - arxiv preprint arxiv …, 2022‏ - arxiv.org
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to
decompose a word into a sequence of morphemes and covered most types of morphology …

[PDF][PDF] Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English.

D Ataman, M Negri, M Turchi… - The Prague Bulletin of …, 2017‏ - archive.sciendo.com
The necessity of using a fixed-size word vocabulary in order to control the model complexity
in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on …

Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics

JM List, JS Pathmanathan, P Lopez, E Bapteste - Biology Direct, 2016‏ - Springer
Background For a long time biologists and linguists have been noticing surprising
similarities between the evolution of life forms and languages. Most of the proposed …

Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages

K Kann, M Mager, I Meza-Ruiz, H Schütze - arxiv preprint arxiv …, 2018‏ - arxiv.org
Morphological segmentation for polysynthetic languages is challenging, because a word
may consist of many individual morphemes and training data can be extremely scarce …

BPE vs. morphological segmentation: A case study on machine translation of four polysynthetic languages

M Mager, A Oncevay, E Mager, K Kann… - arxiv preprint arxiv …, 2022‏ - arxiv.org
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to
data sparsity, and a common strategy to handle this issue is to apply subword segmentation …

A corpus investigation of syntactic embedding in Pirahã

R Futrell, L Stearns, DL Everett, ST Piantadosi… - PLoS …, 2016‏ - journals.plos.org
The Pirahã language has been at the center of recent debates in linguistics, in large part
because it is claimed not to exhibit recursion, a purported universal of human language …