Muril: Multilingual representations for indian languages

S Khanuja, D Bansal, S Mehtani, S Khosla… - arxiv preprint arxiv …, 2021 - arxiv.org
India is a multilingual society with 1369 rationalized languages and dialects being spoken
across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering …

A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development

AF Hidayatullah, A Qazi, DTC Lai, RA Apong - IEEE access, 2022 - ieeexplore.ieee.org
The mix of native language with other languages (code-mixing) in social media has posed a
severe challenge for language identification (LID) systems. It has encouraged research on …

Automatic language identification in texts: A survey

T Jauhiainen, M Lui, M Zampieri, T Baldwin… - Journal of Artificial …, 2019 - jair.org
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …

GLUECoS: An evaluation benchmark for code-switched NLP

S Khanuja, S Dandapat, A Srinivasan… - arxiv preprint arxiv …, 2020 - arxiv.org
Code-switching is the use of more than one language in the same conversation or utterance.
Recently, multilingual contextual embedding models, trained on multiple monolingual …

Language modeling for code-mixing: The role of linguistic theory based synthetic data

A Pratapa, G Bhat, M Choudhury… - Proceedings of the …, 2018 - aclanthology.org
Training language models for Code-mixed (CM) language is known to be a difficult problem
because of lack of data compounded by the increased confusability due to the presence of …

A survey of code-switched speech and language processing

S Sitaram, KR Chandu, SK Rallabandi… - arxiv preprint arxiv …, 2019 - arxiv.org
Code-switching, the alternation of languages within a conversation or utterance, is a
common communicative phenomenon that occurs in multilingual communities across the …

Learning from tweets: opportunities and challenges to inform policy making during dengue epidemic

F Shahid, SH Ony, TR Albi, S Chellappan… - Proceedings of the …, 2020 - dl.acm.org
Social media platforms are widely used by people to report, access, and share information
during outbreaks and epidemics. Although government agencies and healthcare institutions …

Incorporating dialectal variability for socially equitable language identification

D Jurgens, Y Tsvetkov, D Jurafsky - … of the 55th Annual Meeting of …, 2017 - aclanthology.org
Abstract Language identification (LID) is a critical first step for processing multilingual text.
Yet most LID systems are not designed to handle the linguistic diversity of global platforms …

BERTologiCoMix: How does code-mixing interact with multilingual BERT?

S Santy, A Srinivasan, M Choudhury - Proceedings of the Second …, 2021 - aclanthology.org
Abstract Models such as mBERT and XLMR have shown success in solving Code-Mixed
NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed …

Hinge: A dataset for generation and evaluation of code-mixed hinglish text

V Srivastava, M Singh - arxiv preprint arxiv:2107.03760, 2021 - arxiv.org
Text generation is a highly active area of research in the computational linguistic community.
The evaluation of the generated text is a challenging task and multiple theories and metrics …