MTEB: Massive text embedding benchmark

N Muennighoff, N Tazi, L Magne, N Reimers - arxiv preprint arxiv …, 2022 - arxiv.org
Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …

Language-agnostic BERT sentence embedding

F Feng, Y Yang, D Cer, N Arivazhagan… - arxiv preprint arxiv …, 2020 - arxiv.org
While BERT is an effective method for learning monolingual sentence embeddings for
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

J Hu, S Ruder, A Siddhant, G Neubig… - International …, 2020 - proceedings.mlr.press
Much recent progress in applications of machine learning models to NLP has been driven
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

M Artetxe, H Schwenk - … of the association for computational linguistics, 2019 - direct.mit.edu
We introduce an architecture to learn joint multilingual sentence representations for 93
languages, belonging to more than 30 different families and written in 28 different scripts …

Pre-training via paraphrasing

M Lewis, M Ghazvininejad, G Ghosh… - Advances in …, 2020 - proceedings.neurips.cc
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an
unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Rethinking embedding coupling in pre-trained language models

HW Chung, T Fevry, H Tsai, M Johnson… - arxiv preprint arxiv …, 2020 - arxiv.org
We re-evaluate the standard practice of sharing weights between input and output
embeddings in state-of-the-art pre-trained language models. We show that decoupled …

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arxiv preprint arxiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

X-FACTR: Multilingual factual knowledge retrieval from pretrained language models

Z Jiang, A Anastasopoulos, J Araki… - Proceedings of the …, 2020 - aclanthology.org
Abstract Language models (LMs) have proven surprisingly successful at capturing factual
knowledge by completing cloze-style fill-in-the-blank questions such as “Punta Cana is …

A primer on pretrained multilingual language models

S Doddapaneni, G Ramesh, MM Khapra… - arxiv preprint arxiv …, 2021 - arxiv.org
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R,\textit {etc.} have
emerged as a viable option for bringing the power of pretraining to a large number of …