Madlad-400: A multilingual and document-level large audited dataset
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
Findings of the 2019 conference on machine translation (WMT19)
This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …
Findings of the 2021 conference on machine translation (WMT21)
F Akhbardeh, A Arkhangorodsky, M Biesialska… - Proceedings of the sixth …, 2021 - cris.fbk.eu
This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …
translation for Indo-European languages, the triangular translation task, and the automatic …
Building machine translation systems for the next thousand languages
In this paper we share findings from our effort to build practical machine translation (MT)
systems capable of translating across over one thousand languages. We describe results in …
systems capable of translating across over one thousand languages. We describe results in …
When is multilinguality a curse? language modeling for 250 high-and low-resource languages
Multilingual language models are widely used to extend NLP systems to low-resource
languages. However, concrete evidence for the effects of multilinguality on language …
languages. However, concrete evidence for the effects of multilinguality on language …
A unified approach to sentence segmentation of punctuated text in many languages
The sentence is a fundamental unit of text processing. Yet sentences in the wild are
commonly encountered not in isolation, but unsegmented within larger paragraphs and …
commonly encountered not in isolation, but unsegmented within larger paragraphs and …
“Sewing Is Part of Our Tradition”: a case study of sewing as a strategy for arts-based inquiry in health research with inuit women
LJ Brubacher, CE Dewey, N Tatty… - Qualitative Health …, 2021 - journals.sagepub.com
In this article, we present a case study of sewing as a strategy for arts-based inquiry in health
research, situated within a broader project that highlighted Nunavut Inuit women's childbirth …
research, situated within a broader project that highlighted Nunavut Inuit women's childbirth …
Neural polysynthetic language modelling
Research in natural language processing commonly assumes that approaches that work
well for English and and other widely-used languages are" language agnostic". In high …
well for English and and other widely-used languages are" language agnostic". In high …
Neural machine translation for the indigenous languages of the Americas: An introduction
Neural models have drastically advanced state of the art for machine translation (MT)
between high-resource languages. Traditionally, these models rely on large amounts of …
between high-resource languages. Traditionally, these models rely on large amounts of …
Challenges and perspectives for innu-aimun within indigenous language technologies
Abstract Innu-Aimun is an Algonquian language spoken in Eastern Canada. It is the
language of the Innu, an Indigenous people that now lives for the most part in a dozen …
language of the Innu, an Indigenous people that now lives for the most part in a dozen …