Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

NusaCrowd: Open source initiative for Indonesian NLP resources

S Cahyawijaya, H Lovenia, AF Aji, GI Winata… - arxiv preprint arxiv …, 2022 - arxiv.org
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for
Indonesian languages, including opening access to previously non-public resources …

Towards robust automated math problem solving: a survey of statistical and deep learning approaches

A Saraf, P Kamat, S Gite, S Kumar, K Kotecha - Evolutionary Intelligence, 2024 - Springer
Automated mathematical problem-solving represents a unique intersection of natural
language processing (NLP) and mathematical reasoning, posing significant challenges in …

Naamapadam: A large-scale named entity annotated data for Indic languages

A Mhaske, H Kedia, S Doddapaneni… - arxiv preprint arxiv …, 2022 - arxiv.org
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER)
dataset for the 11 major Indian languages from two language families. The dataset contains …

Airavata: Introducing hindi instruction-tuned llm

J Gala, T Jayakumar, JA Husain, MSUR Khan… - arxiv preprint arxiv …, 2024 - arxiv.org
We announce the initial release of" Airavata," an instruction-tuned LLM for Hindi. Airavata
was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make …

medit: Multilingual text editing via instruction tuning

V Raheja, D Alikaniotis, V Kulkarni, B Alhafni… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce mEdIT, a multi-lingual extension to CoEdIT--the recent state-of-the-art text
editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual …

Dolphin: A challenging and diverse benchmark for Arabic NLG

A Elmadany, A El-Shangiti… - Findings of the …, 2023 - aclanthology.org
We present Dolphin, a novel benchmark that addresses the need for a natural language
generation (NLG) evaluation framework dedicated to the wide collection of Arabic …

Pmindiasum: Multilingual and cross-lingual headline summarization for languages in india

A Urlana, P Chen, Z Zhao, SB Cohen… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper introduces PMIndiaSum, a multilingual and massively parallel summarization
corpus focused on languages in India. Our corpus provides a training and testing ground for …

V\= arta: A Large-Scale Headline-Generation Dataset for Indic Languages

R Aralikatte, Z Cheng, S Doddapaneni… - arxiv preprint arxiv …, 2023 - arxiv.org
We present V\= arta, a large-scale multilingual dataset for headline generation in Indic
languages. This dataset includes 41.8 million news articles in 14 different Indic languages …

Building pre-train llm dataset for the indic languages: a case study on hindi

S Parida, S Panwar, K Lata, S Mishra… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) demonstrated transformative capabilities in many
applications that require automatically generating responses based on human instruction …