Beyond english-centric multilingual machine translation
Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …
translation by training a single model able to translate between any pair of languages …
Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …
ParaCrawl: Web-scale acquisition of parallel corpora
M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …
the web, using open source software. We empirically compare alternative methods and …
Detecting hallucinated content in conditional neural sequence generation
Neural sequence models can generate highly fluent sentences, but recent studies have also
shown that they are also prone to hallucinate additional content not supported by the input …
shown that they are also prone to hallucinate additional content not supported by the input …
The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english
For machine translation, a vast majority of language pairs in the world are considered low-
resource because they have little parallel data available. Besides the technical challenges …
resource because they have little parallel data available. Besides the technical challenges …
CCMatrix: Mining billions of high-quality parallel sentences on the web
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
Automatic machine translation evaluation in many languages via zero-shot paraphrasing
We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …
CCAligned: A massive collection of cross-lingual web-document pairs
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …
languages that are of comparable content or translations of each other. In this paper, we …
Vecalign: Improved sentence alignment in linear time and space
We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time
and space with respect to the number of sentences being aligned and which requires only …
and space with respect to the number of sentences being aligned and which requires only …
Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation
Despite being the seventh most widely spoken language in the world, Bengali has received
much less attention in machine translation literature due to being low in resources. Most …
much less attention in machine translation literature due to being low in resources. Most …