Improving text embeddings with large language models

L Wang, N Yang, X Huang, L Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we introduce a novel and simple method for obtaining high-quality text
embeddings using only synthetic data and less than 1k training steps. Unlike existing …

MTEB: Massive text embedding benchmark

N Muennighoff, N Tazi, L Magne, N Reimers - arxiv preprint arxiv …, 2022 - arxiv.org
Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …

Multilingual e5 text embeddings: A technical report

L Wang, N Yang, X Huang, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
This technical report presents the training methodology and evaluation results of the open-
source multilingual E5 text embedding models, released in mid-2023. Three embedding …

Language-agnostic BERT sentence embedding

F Feng, Y Yang, D Cer, N Arivazhagan… - arxiv preprint arxiv …, 2020 - arxiv.org
While BERT is an effective method for learning monolingual sentence embeddings for
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

J Hu, S Ruder, A Siddhant, G Neubig… - International …, 2020 - proceedings.mlr.press
Much recent progress in applications of machine learning models to NLP has been driven
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

M Artetxe, H Schwenk - … of the association for computational linguistics, 2019 - direct.mit.edu
We introduce an architecture to learn joint multilingual sentence representations for 93
languages, belonging to more than 30 different families and written in 28 different scripts …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Pre-training via paraphrasing

M Lewis, M Ghazvininejad, G Ghosh… - Advances in …, 2020 - proceedings.neurips.cc
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an
unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an …

Rethinking embedding coupling in pre-trained language models

HW Chung, T Fevry, H Tsai, M Johnson… - arxiv preprint arxiv …, 2020 - arxiv.org
We re-evaluate the standard practice of sharing weights between input and output
embeddings in state-of-the-art pre-trained language models. We show that decoupled …

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arxiv preprint arxiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …