Four approaches to low-resource multilingual NMT: The Helsinki submission to the AmericasNLP 2023 shared task

O De Gibert, R Vázquez, M Aulamo… - Proceedings of the …, 2023‏ - aclanthology.org
The Helsinki-NLP team participated in the AmericasNLP 2023 Shared Task with 6
submissions for all 11 language pairs arising from 4 different multilingual systems. We …

FastSpell: the LangId Magic Spell

M Bañón, J Zaragoza-Bernabeu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Language identification is a crucial component in the automated production of language
resources, particularly in multilingual and big data contexts. However, commonly used …

LIMIT: Language identification, misidentification, and translation using hierarchical models in 350+ languages

M Agarwal, MMI Alam, A Anastasopoulos - arxiv preprint arxiv …, 2023‏ - arxiv.org
Knowing the language of an input text/audio is a necessary first step for using almost every
NLP tool such as taggers, parsers, or translation systems. Language identification is a well …

Geographically-informed language identification

J Dunn, L Edwards-Brown - arxiv preprint arxiv:2403.09892, 2024‏ - arxiv.org
This paper develops an approach to language identification in which the set of languages
considered by the model depends on the geographic origin of the text in question. Given that …

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

AH Kargaran, F Yvon, H Schütze - arxiv preprint arxiv:2410.23825, 2024‏ - arxiv.org
The need for large text corpora has increased with the advent of pretrained language
models and, in particular, the discovery of scaling laws for these models. Most available …

Transliteration Model for Egyptian Words

H Jauhiainen, T Jauhiainen - Digital Humanities in the …, 2023‏ - researchportal.helsinki.fi
In this paper, we describe token-based transliteration models for Egyptian words. We
explain how we created them using an automatic alignment method we devised based on …

[PDF][PDF] Tuning heli-ots for guarani-spanish code switching analysis

T Jauhiainen, H Jauhiainen, K Lindén - … Evaluation Forum: IberLEF …, 2023‏ - helda.helsinki.fi
This article describes a system created for the first subtask of the GUA-SPA-Guarani-
Spanish Code Switching Analysis shared task held as part of the IberLEF 2023 evaluation …

Script-Agnostic Language Identification

M Agarwal, J Otten, A Anastasopoulos - arxiv preprint arxiv:2406.17901, 2024‏ - arxiv.org
Language identification is used as the first step in many data collection and crawling efforts
because it allows us to sort online text into language-specific buckets. However, many …

Multi-label Scandinavian Language Identification (SLIDE)

M Fedorova, JS Frydenberg, V Handford… - arxiv preprint arxiv …, 2025‏ - arxiv.org
Identifying closely related languages at sentence level is difficult, in particular because it is
often impossible to assign a sentence to a single language. In this paper, we focus on multi …

Murre24: Dialect Identification of Finnish Internet Forum Messages

O Kuparinen - Proceedings of the 2024 Joint International …, 2024‏ - researchportal.helsinki.fi
This paper presents Murre24, a collection of dialectal messages posted on the largest
Finnish internet forum, Suomi24. The messages posted in Finnish on the forum between …