محقق Google

Four approaches to low-resource multilingual NMT: The Helsinki submission to the AmericasNLP 2023 shared task‏

O De Gibert, R Vázquez, M Aulamo… - Proceedings of the …, 2023‏ - aclanthology.org‏

The Helsinki-NLP team participated in the AmericasNLP 2023 Shared Task with 6
submissions for all 11 language pairs arising from 4 different multilingual systems. We …‏

ذخیره ارجاع بیان شده در 7 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FastSpell: the LangId Magic Spell‏

M Bañón, J Zaragoza-Bernabeu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Language identification is a crucial component in the automated production of language
resources, particularly in multilingual and big data contexts. However, commonly used …‏

ذخیره ارجاع بیان شده در 3 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LIMIT: Language identification, misidentification, and translation using hierarchical models in 350+ languages‏

M Agarwal, MMI Alam, A Anastasopoulos - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Knowing the language of an input text/audio is a necessary first step for using almost every
NLP tool such as taggers, parsers, or translation systems. Language identification is a well …‏

ذخیره ارجاع بیان شده در 4 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Geographically-informed language identification‏

J Dunn, L Edwards-Brown - arxiv preprint arxiv:2403.09892, 2024‏ - arxiv.org‏

This paper develops an approach to language identification in which the set of languages
considered by the model depends on the geographic origin of the text in question. Given that …‏

ذخیره ارجاع بیان شده در 3 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages‏

AH Kargaran, F Yvon, H Schütze - arxiv preprint arxiv:2410.23825, 2024‏ - arxiv.org‏

The need for large text corpora has increased with the advent of pretrained language
models and, in particular, the discovery of scaling laws for these models. Most available …‏

ذخیره ارجاع بیان شده در 1 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] helsinki.fi

Transliteration Model for Egyptian Words‏

H Jauhiainen, T Jauhiainen - Digital Humanities in the …, 2023‏ - researchportal.helsinki.fi‏

In this paper, we describe token-based transliteration models for Egyptian words. We
explain how we created them using an automatic alignment method we devised based on …‏

ذخیره ارجاع بیان شده در 3 یافته مقاله‌های مربوط تمام نسخه‌های 9 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] helsinki.fi

[PDF][PDF] Tuning heli-ots for guarani-spanish code switching analysis‏

T Jauhiainen, H Jauhiainen, K Lindén - … Evaluation Forum: IberLEF …, 2023‏ - helda.helsinki.fi‏

This article describes a system created for the first subtask of the GUA-SPA-Guarani-
Spanish Code Switching Analysis shared task held as part of the IberLEF 2023 evaluation …‏

ذخیره ارجاع بیان شده در 2 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Script-Agnostic Language Identification‏

M Agarwal, J Otten, A Anastasopoulos - arxiv preprint arxiv:2406.17901, 2024‏ - arxiv.org‏

Language identification is used as the first step in many data collection and crawling efforts
because it allows us to sort online text into language-specific buckets. However, many …‏

ذخیره ارجاع مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-label Scandinavian Language Identification (SLIDE)‏

M Fedorova, JS Frydenberg, V Handford… - arxiv preprint arxiv …, 2025‏ - arxiv.org‏

Identifying closely related languages at sentence level is difficult, in particular because it is
often impossible to assign a sentence to a single language. In this paper, we focus on multi …‏

ذخیره ارجاع مقاله‌های مربوط نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] helsinki.fi

Murre24: Dialect Identification of Finnish Internet Forum Messages‏

O Kuparinen - Proceedings of the 2024 Joint International …, 2024‏ - researchportal.helsinki.fi‏

This paper presents Murre24, a collection of dialectal messages posted on the largest
Finnish internet forum, Suomi24. The messages posted in Finnish on the forum between …‏

ذخیره ارجاع بیان شده در 1 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

HeLI-OTS, off-the-shelf language identifier for text

Four approaches to low-resource multilingual NMT: The Helsinki submission to the AmericasNLP 2023 shared task‏

FastSpell: the LangId Magic Spell‏

LIMIT: Language identification, misidentification, and translation using hierarchical models in 350+ languages‏

Geographically-informed language identification‏

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages‏

Transliteration Model for Egyptian Words‏

[PDF][PDF] Tuning heli-ots for guarani-spanish code switching analysis‏

Script-Agnostic Language Identification‏

Multi-label Scandinavian Language Identification (SLIDE)‏

Murre24: Dialect Identification of Finnish Internet Forum Messages‏