Aya model: An instruction finetuned open-access multilingual language model

A Üstün, V Aryabumi, ZX Yong, WY Ko… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent breakthroughs in large language models (LLMs) have centered around a handful of
data-rich languages. What does it take to broaden access to breakthroughs beyond first …

Natural language understanding of devanagari script languages: Language identification, hate speech and its target detection

S Thapa, K Rauniyar, FA Jafri, S Adhikari… - Proceedings of the …, 2025 - aclanthology.org
The growing use of Devanagari-script languages such as Hindi, Nepali, Marathi, Sanskrit,
and Bhojpuri on social media presents unique challenges for natural language …

Aya dataset: An open-access collection for multilingual instruction tuning

S Singh, F Vargus, D Dsouza, BF Karlsson… - arxiv preprint arxiv …, 2024 - arxiv.org
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many
recent achievements in the space of natural language processing (NLP) can be attributed to …

Mc2: Towards transparent and culturally-aware nlp for minority languages in china

C Zhang, M Tao, Q Huang, J Lin, Z Chen… - Proceedings of the …, 2024 - aclanthology.org
Current large language models demonstrate deficiencies in understanding low-resource
languages, particularly the minority languages in China. This limitation stems from the …

Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models

WQ Leong, JG Ngui, Y Susanto, H Rengarajan… - arxiv preprint arxiv …, 2023 - arxiv.org
The rapid development of Large Language Models (LLMs) and the emergence of novel
abilities with scale have necessitated the construction of holistic, diverse and challenging …

Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

R Hada, S Husain, V Gumma, H Diddee… - The 2024 ACM …, 2024 - dl.acm.org
Existing research in measuring and mitigating gender bias predominantly centers on
English, overlooking the intricate challenges posed by non-English languages and the …

Airavata: Introducing hindi instruction-tuned llm

J Gala, T Jayakumar, JA Husain, MSUR Khan… - arxiv preprint arxiv …, 2024 - arxiv.org
We announce the initial release of" Airavata," an instruction-tuned LLM for Hindi. Airavata
was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make …

OffensEval 2023: Offensive language identification in the age of Large Language Models

M Zampieri, S Rosenthal, P Nakov… - Natural Language …, 2023 - cambridge.org
The OffensEval shared tasks organized as part of SemEval-2019–2020 were very popular,
attracting over 1300 participating teams. The two editions of the shared task helped advance …

Too late to train, too early to use? a study on necessity and viability of low-resource bengali llms

T Mahfuz, SK Dey, R Naswan, H Adil… - arxiv preprint arxiv …, 2024 - arxiv.org
Each new generation of English-oriented Large Language Models (LLMs) exhibits
enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low …

Vacaspati: A diverse corpus of bangla literature

P Bhattacharyya, J Mondal, S Maji… - arxiv preprint arxiv …, 2023 - arxiv.org
Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in
Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is …