Muril: Multilingual representations for indian languages
India is a multilingual society with 1369 rationalized languages and dialects being spoken
across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering …
across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering …
A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development
The mix of native language with other languages (code-mixing) in social media has posed a
severe challenge for language identification (LID) systems. It has encouraged research on …
severe challenge for language identification (LID) systems. It has encouraged research on …
Automatic language identification in texts: A survey
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …
document or part thereof is written in. Automatic LI has been extensively researched for over …
GLUECoS: An evaluation benchmark for code-switched NLP
Code-switching is the use of more than one language in the same conversation or utterance.
Recently, multilingual contextual embedding models, trained on multiple monolingual …
Recently, multilingual contextual embedding models, trained on multiple monolingual …
Language modeling for code-mixing: The role of linguistic theory based synthetic data
Training language models for Code-mixed (CM) language is known to be a difficult problem
because of lack of data compounded by the increased confusability due to the presence of …
because of lack of data compounded by the increased confusability due to the presence of …
A survey of code-switched speech and language processing
Code-switching, the alternation of languages within a conversation or utterance, is a
common communicative phenomenon that occurs in multilingual communities across the …
common communicative phenomenon that occurs in multilingual communities across the …
Learning from tweets: opportunities and challenges to inform policy making during dengue epidemic
Social media platforms are widely used by people to report, access, and share information
during outbreaks and epidemics. Although government agencies and healthcare institutions …
during outbreaks and epidemics. Although government agencies and healthcare institutions …
Incorporating dialectal variability for socially equitable language identification
Abstract Language identification (LID) is a critical first step for processing multilingual text.
Yet most LID systems are not designed to handle the linguistic diversity of global platforms …
Yet most LID systems are not designed to handle the linguistic diversity of global platforms …
BERTologiCoMix: How does code-mixing interact with multilingual BERT?
Abstract Models such as mBERT and XLMR have shown success in solving Code-Mixed
NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed …
NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed …
Hinge: A dataset for generation and evaluation of code-mixed hinglish text
Text generation is a highly active area of research in the computational linguistic community.
The evaluation of the generated text is a challenging task and multiple theories and metrics …
The evaluation of the generated text is a challenging task and multiple theories and metrics …