Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2023 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

Findings of the 2019 conference on machine translation (WMT19)

L Barrault, O Bojar, MR Costa-Jussa, C Federmann… - 2019 - zora.uzh.ch
This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …

Findings of the 2021 conference on machine translation (WMT21)

F Akhbardeh, A Arkhangorodsky, M Biesialska… - Proceedings of the sixth …, 2021 - cris.fbk.eu
This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …

Neural machine translation with byte-level subwords

C Wang, K Cho, J Gu - Proceedings of the AAAI conference on artificial …, 2020 - aaai.org
Almost all existing machine translation models are built on top of character-based
vocabularies: characters, subwords or words. Rare characters from noisy text or character …

Data augmentation using back-translation for context-aware neural machine translation

A Sugiyama, N Yoshinaga - … of the fourth workshop on discourse …, 2019 - aclanthology.org
A single sentence does not always convey information that is enough to translate it into other
languages. Some target languages need to add or specialize words that are omitted or …

MTNT: A testbed for machine translation of noisy text

P Michel, G Neubig - arxiv preprint arxiv:1809.00388, 2018 - arxiv.org
Noisy or non-standard input text can cause disastrous mistranslations in most modern
Machine Translation (MT) systems, and there has been growing research interest in creating …

Machine translation and its evaluation: a study

SK Mondal, H Zhang, HMD Kabir, K Ni… - Artificial Intelligence …, 2023 - Springer
Abstract Machine translation (namely MT) has been one of the most popular fields in
computational linguistics and Artificial Intelligence (AI). As one of the most promising …

Mural: multimodal, multitask retrieval across languages

A Jain, M Guo, K Srinivasan, T Chen… - arxiv preprint arxiv …, 2021 - arxiv.org
Both image-caption pairs and translation pairs provide the means to learn deep
representations of and connections between languages. We use both types of pairs in …

Findings of the first shared task on machine translation robustness

X Li, P Michel, A Anastasopoulos, Y Belinkov… - arxiv preprint arxiv …, 2019 - arxiv.org
We share the findings of the first shared task on improving robustness of Machine
Translation (MT). The task provides a testbed representing challenges facing MT models …

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arxiv preprint arxiv:1911.10668, 2019 - arxiv.org
Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …