Survey of low-resource machine translation

B Haddow, R Bawden, AVM Barone, J Helcl… - Computational …, 2022 - direct.mit.edu
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …

Having beer after prayer? measuring cultural bias in large language models

T Naous, MJ Ryan, A Ritter, W Xu - arxiv preprint arxiv:2305.14456, 2023 - arxiv.org
As the reach of large language models (LMs) expands globally, their ability to cater to
diverse cultural contexts becomes crucial. Despite advancements in multilingual …

Bloom+ 1: Adding language support to bloom for zero-shot prompting

ZX Yong, H Schoelkopf, N Muennighoff, AF Aji… - arxiv preprint arxiv …, 2022 - arxiv.org
The BLOOM model is a large publicly available multilingual language model, but its
pretraining was limited to 46 languages. To extend the benefits of BLOOM to other …

A primer on pretrained multilingual language models

S Doddapaneni, G Ramesh, MM Khapra… - arxiv preprint arxiv …, 2021 - arxiv.org
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R,\textit {etc.} have
emerged as a viable option for bringing the power of pretraining to a large number of …

AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages

A Ebrahimi, M Mager, A Oncevay, V Chaudhary… - arxiv preprint arxiv …, 2021 - arxiv.org
Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot
setting, even for languages unseen during pretraining. However, prior work evaluating …

Expanding pretrained models to thousands more languages via lexicon-based adaptation

X Wang, S Ruder, G Neubig - arxiv preprint arxiv:2203.09435, 2022 - arxiv.org
The performance of multilingual pretrained models is highly dependent on the availability of
monolingual or parallel text present in a target language. Thus, the majority of the world's …

Do all languages cost the same? tokenization in the era of commercial language models

O Ahia, S Kumar, H Gonen, J Kasai… - arxiv preprint arxiv …, 2023 - arxiv.org
Language models have graduated from being research prototypes to commercialized
products offered as web APIs, and recent works have highlighted the multilingual …

How to adapt your pretrained multilingual model to 1600 languages

A Ebrahimi, K Kann - arxiv preprint arxiv:2106.02124, 2021 - arxiv.org
Pretrained multilingual models (PMMs) enable zero-shot learning via cross-lingual transfer,
performing best for languages seen during pretraining. While methods exist to improve …

Some languages are more equal than others: Probing deeper into the linguistic disparity in the nlp world

S Ranathunga, N De Silva - arxiv preprint arxiv:2210.08523, 2022 - arxiv.org
Linguistic disparity in the NLP world is a problem that has been widely acknowledged
recently. However, different facets of this problem, or the reasons behind this disparity are …

Mala-500: Massive language adaptation of large language models

P Lin, S Ji, J Tiedemann, AFT Martins… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models have advanced the state of the art in natural language processing.
However, their predominant design for English or a limited set of languages creates a …