Beyond english-centric multilingual machine translation

A Fan, S Bhosale, H Schwenk, Z Ma, A El-Kishky… - Journal of Machine …, 2021‏ - jmlr.org
Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arxiv preprint arxiv …, 2019‏ - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020‏ - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

Detecting hallucinated content in conditional neural sequence generation

C Zhou, G Neubig, J Gu, M Diab, P Guzman… - arxiv preprint arxiv …, 2020‏ - arxiv.org
Neural sequence models can generate highly fluent sentences, but recent studies have also
shown that they are also prone to hallucinate additional content not supported by the input …

The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english

F Guzmán, PJ Chen, M Ott, J Pino, G Lample… - arxiv preprint arxiv …, 2019‏ - arxiv.org
For machine translation, a vast majority of language pairs in the world are considered low-
resource because they have little parallel data available. Besides the technical challenges …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019‏ - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Automatic machine translation evaluation in many languages via zero-shot paraphrasing

B Thompson, M Post - arxiv preprint arxiv:2004.14564, 2020‏ - arxiv.org
We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …

CCAligned: A massive collection of cross-lingual web-document pairs

A El-Kishky, V Chaudhary, F Guzmán… - arxiv preprint arxiv …, 2019‏ - arxiv.org
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …

Vecalign: Improved sentence alignment in linear time and space

B Thompson, P Koehn - Proceedings of the 2019 conference on …, 2019‏ - aclanthology.org
We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time
and space with respect to the number of sentences being aligned and which requires only …

Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation

T Hasan, A Bhattacharjee, K Samin, M Hasan… - arxiv preprint arxiv …, 2020‏ - arxiv.org
Despite being the seventh most widely spoken language in the world, Bengali has received
much less attention in machine translation literature due to being low in resources. Most …