Beyond english-centric multilingual machine translation

A Fan, S Bhosale, H Schwenk, Z Ma, A El-Kishky… - Journal of Machine …, 2021 - jmlr.org
Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

I Vulić, W De Smet, J Tang, MF Moens - Information Processing & …, 2015 - Elsevier
Probabilistic topic models are unsupervised generative models which model document
content as a two-step generation process, that is, documents are observed as mixtures of …

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arxiv preprint arxiv …, 2019 - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

Bitext alignment

J Tiedemann - 2011 - books.google.com
This book provides an overview of various techniques for the alignment of bitexts. It
describes general concepts and strategies that can be applied to map corresponding parts …

A survey of domain adaptation for machine translation

C Chu, R Wang - Journal of information processing, 2020 - jstage.jst.go.jp
Neural machine translation (NMT) is a deep learning based approach for machine
translation, which outperforms traditional statistical machine translation (SMT) and yields the …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arxiv preprint arxiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

Crowdsourcing and online collaborative translations

MA Jiménez-Crespo - 2017 - torrossa.com
We control the world basically because we are the only animals that can cooperate flexibly
in very large numbers […] This is something very unique to us, perhaps the most unique …

Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax

Y Yang, GH Abrego, S Yuan, M Guo, Q Shen… - arxiv preprint arxiv …, 2019 - arxiv.org
In this paper, we present an approach to learn multilingual sentence embeddings using a bi-
directional dual-encoder with additive margin softmax. The embeddings are able to achieve …