- Academic Search

L Wang, N Yang, X Huang, L Yang… - arxiv preprint arxiv …, 2023 - arxiv.org

In this paper, we introduce a novel and simple method for obtaining high-quality text
embeddings using only synthetic data and less than 1k training steps. Unlike existing …

Lagre Referanse Sitert av 281 Beslektede artikler Alle 5 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MTEB: Massive text embedding benchmark

N Muennighoff, N Tazi, L Magne, N Reimers - arxiv preprint arxiv …, 2022 - arxiv.org

Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …

Lagre Referanse Sitert av 653 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multilingual e5 text embeddings: A technical report

L Wang, N Yang, X Huang, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

This technical report presents the training methodology and evaluation results of the open-
source multilingual E5 text embedding models, released in mid-2023. Three embedding …

Lagre Referanse Sitert av 163 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language-agnostic BERT sentence embedding

F Feng, Y Yang, D Cer, N Arivazhagan… - arxiv preprint arxiv …, 2020 - arxiv.org

While BERT is an effective method for learning monolingual sentence embeddings for
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …

Lagre Referanse Sitert av 1028 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

J Hu, S Ruder, A Siddhant, G Neubig… - International …, 2020 - proceedings.mlr.press

Much recent progress in applications of machine learning models to NLP has been driven
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …

Lagre Referanse Sitert av 952 Beslektede artikler Alle 5 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] mit.edu

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

M Artetxe, H Schwenk - … of the association for computational linguistics, 2019 - direct.mit.edu

We introduce an architecture to learn joint multilingual sentence representations for 93
languages, belonging to more than 30 different families and written in 28 different scripts …

Lagre Referanse Sitert av 1155 Beslektede artikler Alle 10 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org

We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Lagre Referanse Sitert av 244 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Pre-training via paraphrasing

M Lewis, M Ghazvininejad, G Ghosh… - Advances in …, 2020 - proceedings.neurips.cc

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an
unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an …

Lagre Referanse Sitert av 166 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rethinking embedding coupling in pre-trained language models

HW Chung, T Fevry, H Tsai, M Johnson… - arxiv preprint arxiv …, 2020 - arxiv.org

We re-evaluate the standard practice of sharing weights between input and output
embeddings in state-of-the-art pre-trained language models. We show that decoupled …

Lagre Referanse Sitert av 149 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arxiv preprint arxiv:1811.01136, 2018 - arxiv.org

Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

Lagre Referanse Sitert av 229 Beslektede artikler Alle 5 versjoner HTML-versjon

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora

Improving text embeddings with large language models

MTEB: Massive text embedding benchmark

Multilingual e5 text embeddings: A technical report

Language-agnostic BERT sentence embedding

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

CCMatrix: Mining billions of high-quality parallel sentences on the web

Pre-training via paraphrasing

Rethinking embedding coupling in pre-trained language models

Margin-based parallel corpus mining with multilingual sentence embeddings