الباحث العلمي من Google

A Fan, S Bhosale, H Schwenk, Z Ma, A El-Kishky… - Journal of Machine …, 2021‏ - jmlr.org‏

Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …‏

حفظ اقتباس تم اقتباسها في عدد: 859 مقالات ذات صلة الإصدارات الـ 9كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia‏

H Schwenk, V Chaudhary, S Sun, H Gong… - arxiv preprint arxiv …, 2019‏ - arxiv.org‏

We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …‏

حفظ اقتباس تم اقتباسها في عدد: 367 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] strath.ac.uk

ParaCrawl: Web-scale acquisition of parallel corpora‏

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020‏ - strathprints.strath.ac.uk‏

We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …‏

حفظ اقتباس تم اقتباسها في عدد: 274 مقالات ذات صلة الإصدارات الـ 17كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Detecting hallucinated content in conditional neural sequence generation‏

C Zhou, G Neubig, J Gu, M Diab, P Guzman… - arxiv preprint arxiv …, 2020‏ - arxiv.org‏

Neural sequence models can generate highly fluent sentences, but recent studies have also
shown that they are also prone to hallucinate additional content not supported by the input …‏

حفظ اقتباس تم اقتباسها في عدد: 206 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english‏

F Guzmán, PJ Chen, M Ott, J Pino, G Lample… - arxiv preprint arxiv …, 2019‏ - arxiv.org‏

For machine translation, a vast majority of language pairs in the world are considered low-
resource because they have little parallel data available. Besides the technical challenges …‏

حفظ اقتباس تم اقتباسها في عدد: 320 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CCMatrix: Mining billions of high-quality parallel sentences on the web‏

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019‏ - arxiv.org‏

We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …‏

حفظ اقتباس تم اقتباسها في عدد: 241 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Automatic machine translation evaluation in many languages via zero-shot paraphrasing‏

B Thompson, M Post - arxiv preprint arxiv:2004.14564, 2020‏ - arxiv.org‏

We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …‏

حفظ اقتباس تم اقتباسها في عدد: 193 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CCAligned: A massive collection of cross-lingual web-document pairs‏

A El-Kishky, V Chaudhary, F Guzmán… - arxiv preprint arxiv …, 2019‏ - arxiv.org‏

Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …‏

حفظ اقتباس تم اقتباسها في عدد: 185 مقالات ذات صلة الإصدارات الـ 9كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Vecalign: Improved sentence alignment in linear time and space‏

B Thompson, P Koehn - Proceedings of the 2019 conference on …, 2019‏ - aclanthology.org‏

We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time
and space with respect to the number of sentences being aligned and which requires only …‏

حفظ اقتباس تم اقتباسها في عدد: 121 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation‏

T Hasan, A Bhattacharjee, K Samin, M Hasan… - arxiv preprint arxiv …, 2020‏ - arxiv.org‏

Despite being the seventh most widely spoken language in the world, Bengali has received
much less attention in machine translation literature due to being low in resources. Most …‏

حفظ اقتباس تم اقتباسها في عدد: 76 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

Low-resource corpus filtering using multilingual sentence embeddings

Beyond english-centric multilingual machine translation‏

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia‏

ParaCrawl: Web-scale acquisition of parallel corpora‏

Detecting hallucinated content in conditional neural sequence generation‏

The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english‏

CCMatrix: Mining billions of high-quality parallel sentences on the web‏

Automatic machine translation evaluation in many languages via zero-shot paraphrasing‏

CCAligned: A massive collection of cross-lingual web-document pairs‏

Vecalign: Improved sentence alignment in linear time and space‏

Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation‏