Automatic language identification in texts: A survey

T Jauhiainen, M Lui, M Zampieri, T Baldwin… - Journal of Artificial …, 2019 - jair.org
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …

Deep models for arabic dialect identification on benchmarked data

M Elaraby, M Abdul-Mageed - … of the Fifth Workshop on NLP for …, 2018 - aclanthology.org
Abstract The Arabic Online Commentary (AOC)(Zaidan and Callison-Burch, 2011) is a large-
scale repos-itory of Arabic dialects with manual labels for4varieties of the language. Existing …

When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages

M Medvedeva, M Kroon, B Plank - … of the Fourth Workshop on NLP …, 2017 - aclanthology.org
We present the results of our participation in the VarDial 4 shared task on discriminating
closely related languages. Our submission includes simple traditional models using linear …

A fast, compact, accurate model for language identification of codemixed text

Y Zhang, J Riesa, D Gillick, A Bakalov… - arxiv preprint arxiv …, 2018 - arxiv.org
We address fine-grained multilingual language identification: providing a language code for
every token in a sentence, including codemixed text containing multiple languages. Such …

Comparing approaches to Dravidian language identification

T Jauhiainen, T Ranasinghe, M Zampieri - arxiv preprint arxiv:2103.05552, 2021 - arxiv.org
This paper describes the submissions by team HWR to the Dravidian Language
Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set …

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

M Toftrup, SA Sørensen, MR Ciosici… - arxiv preprint arxiv …, 2021 - arxiv.org
Language Identification is the task of identifying a document's language. For applications
like automatic spell checker selection, language identification must use very short strings …

A dataset and classifier for recognizing social media English

SL Blodgett, J Wei, B O'Connor - … of the 3rd Workshop on Noisy …, 2017 - aclanthology.org
While language identification works well on standard texts, it performs much worse on social
media language, in particular dialectal language—even for English. First, to support work on …

Code-switched language identification is harder than you think

L Burchell, A Birch, RP Thompson… - arxiv preprint arxiv …, 2024 - arxiv.org
Code switching (CS) is a very common phenomenon in written and spoken communication
but one that is handled poorly by many natural language processing applications. Looking …

Geographically-informed language identification

J Dunn, L Edwards-Brown - arxiv preprint arxiv:2403.09892, 2024 - arxiv.org
This paper develops an approach to language identification in which the set of languages
considered by the model depends on the geographic origin of the text in question. Given that …

Language identification for austronesian languages

J Dunn, W Nijhof - arxiv preprint arxiv:2206.04327, 2022 - arxiv.org
This paper provides language identification models for low-and under-resourced languages
in the Pacific region with a focus on previously unavailable Austronesian languages …