Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages

J Gala, PA Chitale, R AK, V Gumma… - arxiv preprint arxiv …, 2023 - arxiv.org
India has a rich linguistic landscape with languages from 4 major language families spoken
by over a billion people. 22 of these languages are listed in the Constitution of India …

Naamapadam: A large-scale named entity annotated data for Indic languages

A Mhaske, H Kedia, S Doddapaneni… - arxiv preprint arxiv …, 2022 - arxiv.org
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER)
dataset for the 11 major Indian languages from two language families. The dataset contains …

A survey on nlp resources, tools, and techniques for marathi language processing

P Lahoti, N Mittal, G Singh - ACM Transactions on Asian and Low …, 2022 - dl.acm.org
Natural Language Processing (NLP) has been in practice for the past couple of decades,
and extensive work has been done for the Western languages, particularly the English …

User-aware multilingual abusive content detection in social media

MZU Rehman, S Mehta, K Singh, K Kaushik… - Information Processing & …, 2023 - Elsevier
Despite growing efforts to halt distasteful content on social media, multilingualism has added
a new dimension to this problem. The scarcity of resources makes the challenge even …

IndicLLMSuite: a blueprint for creating pre-training and fine-tuning datasets for indian languages

MSUR Khan, P Mehta, A Sankar… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the considerable advancements in English LLMs, the progress in building
comparable models for other languages has been hindered due to the scarcity of tailored …

Towards building text-to-speech systems for the next billion users

GK Kumar, SV Praveen, P Kumar… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Deep learning based text-to-speech (TTS) systems have been evolving rapidly with
advances in model architectures, training methodologies, and generalization across …

Bhasha-Abhijnaanam: Native-script and romanized language identification for 22 Indic languages

Y Madhani, MM Khapra, A Kunchukuttan - arxiv preprint arxiv:2305.15814, 2023 - arxiv.org
We create publicly available language identification (LID) datasets and models in all 22
Indian languages listed in the Indian constitution in both native-script and romanized text …

Romanization-based large-scale adaptation of multilingual language models

S Purkayastha, S Ruder, J Pfeiffer, I Gurevych… - arxiv preprint arxiv …, 2023 - arxiv.org
Large multilingual pretrained language models (mPLMs) have become the de facto state of
the art for cross-lingual transfer in NLP. However, their large-scale deployment to many …

Context-aware transliteration of romanized south asian languages

C Kirov, C Johny, A Katanova, A Gutkin… - Computational …, 2024 - direct.mit.edu
While most transliteration research is focused on single tokens such as named entities—for
example, transliteration of from the Gujarati script to the Latin script “Ahmedabad” …

Improving pretraining techniques for code-switched NLP

R Das, S Ranjan, S Pathak, P Jyothi - Proceedings of the 61st …, 2023 - aclanthology.org
Pretrained models are a mainstay in modern NLP applications. Pretraining requires access
to large volumes of unlabeled text. While monolingual text is readily available for many of …