Natural language processing for dialects of a language: A survey

A Joshi, R Dabre, D Kanojia, Z Li, H Zhan… - ACM Computing …, 2025 - dl.acm.org
State-of-the-art natural language processing (NLP) models are trained on massive training
corpora, and report a superlative performance on evaluation datasets. This survey delves …

When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models

B Muller, A Anastasopoulos, B Sagot… - arxiv preprint arxiv …, 2020 - arxiv.org
Transfer learning based on pretraining language models on a large amount of raw data has
become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear …

Dziribert: a pre-trained language model for the algerian dialect

A Abdaoui, M Berrimi, M Oussalah… - arxiv preprint arxiv …, 2021 - arxiv.org
Pre-trained transformers are now the de facto models in Natural Language Processing given
their state-of-the-art results in many tasks and languages. However, most of the current …

A Golden Age: Conspiracy Theories' Relationship with Misinformation Outlets, News Media, and the Wider Internet

HWA Hanley, D Kumar, Z Durumeric - … of the ACM on Human-Computer …, 2023 - dl.acm.org
Do we live in a" Golden Age of Conspiracy Theories?" In the last few decades, conspiracy
theories have proliferated on the Internet with some having dangerous real-world …

Dolphin: A challenging and diverse benchmark for Arabic NLG

A Elmadany, A El-Shangiti… - Findings of the …, 2023 - aclanthology.org
We present Dolphin, a novel benchmark that addresses the need for a natural language
generation (NLG) evaluation framework dedicated to the wide collection of Arabic …

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

M Sanguinetti, C Bosco, L Cassidy, Ö Çetinoğlu… - Language Resources …, 2023 - Springer
This article presents a discussion on the main linguistic phenomena which cause difficulties
in the analysis of user-generated texts found on the web and in social media, and proposes …

Benchmarking llama-3 on arabic language generation tasks

MTI Khondaker, N Naeem, F Khan… - Proceedings of The …, 2024 - aclanthology.org
Open-sourced large language models (LLMs) have exhibited remarkable performance in a
variety of NLP tasks, often catching up with the closed-sourced LLMs like ChatGPT. Among …

Multilingual irony detection with dependency syntax and neural models

AT Cignarella, V Basile, M Sanguinetti, C Bosco… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents an in-depth investigation of the effectiveness of dependency-based
syntactic features on the irony detection task in a multilingual perspective (English, Spanish …

Identifying code-switching in Arabizi

S Shehadi, S Wintner - Proceedings of the Seventh Arabic Natural …, 2022 - aclanthology.org
We describe a corpus of social media posts that include utterances in Arabizi, a Roman-
script rendering of Arabic, mixed with other languages, notably English, French, and Arabic …

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

A Riabi, B Sagot, D Seddah - arxiv preprint arxiv:2110.13658, 2021 - arxiv.org
Recent impressive improvements in NLP, largely based on the success of contextual neural
language models, have been mostly demonstrated on at most a couple dozen high-resource …