Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

Findings of the 2019 conference on machine translation (WMT19)

L Barrault, O Bojar, MR Costa-Jussa, C Federmann… - 2019 - zora.uzh.ch
This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …

Findings of the 2021 conference on machine translation (WMT21)

F Akhbardeh, A Arkhangorodsky, M Biesialska… - Proceedings of the sixth …, 2021 - cris.fbk.eu
This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …

Building machine translation systems for the next thousand languages

A Bapna, I Caswell, J Kreutzer, O Firat… - arxiv preprint arxiv …, 2022 - arxiv.org
In this paper we share findings from our effort to build practical machine translation (MT)
systems capable of translating across over one thousand languages. We describe results in …

When is multilinguality a curse? language modeling for 250 high-and low-resource languages

TA Chang, C Arnett, Z Tu, BK Bergen - arxiv preprint arxiv:2311.09205, 2023 - arxiv.org
Multilingual language models are widely used to extend NLP systems to low-resource
languages. However, concrete evidence for the effects of multilinguality on language …

A unified approach to sentence segmentation of punctuated text in many languages

R Wicks, M Post - Proceedings of the 59th Annual Meeting of the …, 2021 - aclanthology.org
The sentence is a fundamental unit of text processing. Yet sentences in the wild are
commonly encountered not in isolation, but unsegmented within larger paragraphs and …

“Sewing Is Part of Our Tradition”: a case study of sewing as a strategy for arts-based inquiry in health research with inuit women

LJ Brubacher, CE Dewey, N Tatty… - Qualitative Health …, 2021 - journals.sagepub.com
In this article, we present a case study of sewing as a strategy for arts-based inquiry in health
research, situated within a broader project that highlighted Nunavut Inuit women's childbirth …

Neural polysynthetic language modelling

L Schwartz, F Tyers, L Levin, C Kirov, P Littell… - arxiv preprint arxiv …, 2020 - arxiv.org
Research in natural language processing commonly assumes that approaches that work
well for English and and other widely-used languages are" language agnostic". In high …

Neural machine translation for the indigenous languages of the Americas: An introduction

M Mager, R Bhatnagar, G Neubig, NT Vu… - arxiv preprint arxiv …, 2023 - arxiv.org
Neural models have drastically advanced state of the art for machine translation (MT)
between high-resource languages. Traditionally, these models rely on large amounts of …

Challenges and perspectives for innu-aimun within indigenous language technologies

A Cadotte, T Le Ngoc, M Boivin… - Proceedings of the Fifth …, 2022 - aclanthology.org
Abstract Innu-Aimun is an Algonquian language spoken in Eastern Canada. It is the
language of the Innu, an Indigenous people that now lives for the most part in a dozen …