Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning

VD Lai, NT Ngo, APB Veyseh, H Man… - arxiv preprint arxiv …, 2023 - arxiv.org
Over the last few years, large language models (LLMs) have emerged as the most important
breakthroughs in natural language processing (NLP) that fundamentally transform research …

Universal dependencies

MC De Marneffe, CD Manning, J Nivre… - Computational …, 2021 - direct.mit.edu
Universal dependencies (UD) is a framework for morphosyntactic annotation of human
language, which to date has been used to create treebanks for more than 100 languages. In …

COMET: A neural framework for MT evaluation

R Rei, C Stewart, AC Farinha, A Lavie - arxiv preprint arxiv:2009.09025, 2020 - arxiv.org
We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …

A primer in BERTology: What we know about how BERT works

A Rogers, O Kovaleva, A Rumshisky - Transactions of the association …, 2021 - direct.mit.edu
Transformer-based models have pushed state of the art in many areas of NLP, but our
understanding of what is behind their success is still limited. This paper is the first survey of …

CamemBERT: a tasty French language model

L Martin, B Muller, PJO Suárez, Y Dupont… - arxiv preprint arxiv …, 2019 - arxiv.org
Pretrained language models are now ubiquitous in Natural Language Processing. Despite
their success, most available models have either been trained on English data or on the …

From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers

A Lauscher, V Ravishankar, I Vulić… - arxiv preprint arxiv …, 2020 - arxiv.org
Massively multilingual transformers pretrained with language modeling objectives (eg,
mBERT, XLM-R) have become a de facto default transfer paradigm for zero-shot cross …

IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP

F Koto, A Rahimi, JH Lau, T Baldwin - arxiv preprint arxiv:2011.00677, 2020 - arxiv.org
Although the Indonesian language is spoken by almost 200 million people and the 10th
most spoken language in the world, it is under-represented in NLP research. Previous work …

Multilingual is not enough: BERT for Finnish

A Virtanen, J Kanerva, R Ilo, J Luoma… - arxiv preprint arxiv …, 2019 - arxiv.org
Deep learning-based language models pretrained on large unannotated text corpora have
been demonstrated to allow efficient transfer learning for natural language processing, with …

Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback

VD Lai, C Van Nguyen, NT Ngo, T Nguyen… - arxiv preprint arxiv …, 2023 - arxiv.org
A key technology for the development of large language models (LLMs) involves instruction
tuning that helps align the models' responses with human expectations to realize impressive …

Systematic inequalities in language technology performance across the world's languages

D Blasi, A Anastasopoulos, G Neubig - arxiv preprint arxiv:2110.06733, 2021 - arxiv.org
Natural language processing (NLP) systems have become a central technology in
communication, education, medicine, artificial intelligence, and many other domains of …