MTEB: Massive text embedding benchmark
Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …
covering their possible applications to other tasks. It is unclear whether state-of-the-art …
Language-agnostic BERT sentence embedding
While BERT is an effective method for learning monolingual sentence embeddings for
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation
Much recent progress in applications of machine learning models to NLP has been driven
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …
Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond
We introduce an architecture to learn joint multilingual sentence representations for 93
languages, belonging to more than 30 different families and written in 28 different scripts …
languages, belonging to more than 30 different families and written in 28 different scripts …
Pre-training via paraphrasing
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an
unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an …
unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an …
CCMatrix: Mining billions of high-quality parallel sentences on the web
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
Rethinking embedding coupling in pre-trained language models
We re-evaluate the standard practice of sharing weights between input and output
embeddings in state-of-the-art pre-trained language models. We show that decoupled …
embeddings in state-of-the-art pre-trained language models. We show that decoupled …
Margin-based parallel corpus mining with multilingual sentence embeddings
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …
X-FACTR: Multilingual factual knowledge retrieval from pretrained language models
Abstract Language models (LMs) have proven surprisingly successful at capturing factual
knowledge by completing cloze-style fill-in-the-blank questions such as “Punta Cana is …
knowledge by completing cloze-style fill-in-the-blank questions such as “Punta Cana is …
A primer on pretrained multilingual language models
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R,\textit {etc.} have
emerged as a viable option for bringing the power of pretraining to a large number of …
emerged as a viable option for bringing the power of pretraining to a large number of …