Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arxiv preprint arxiv …, 2019 - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

CsFEVER and CTKFacts: acquiring Czech data for fact verification

H Ullrich, J Drchal, M Rýpar, H Vincourová… - Language Resources …, 2023 - Springer
In this paper, we examine several methods of acquiring Czech data for automated fact-
checking, which is a task commonly modeled as a classification of textual claim veracity wrt …

Tep: Tehran english-persian parallel corpus

MT Pilevar, H Faili, AH Pilevar - International conference on intelligent text …, 2011 - Springer
Parallel corpora are one of the key resources in natural language processing. In spite of
their importance in many multi-lingual applications, no large-scale English-Persian corpus …

Semantic orientation of crosslingual sentiments: Employment of lexicon and dictionaries

AA Raza, A Habib, J Ashraf, B Shah, F Moreira - IEEE Access, 2023 - ieeexplore.ieee.org
Sentiment Analysis is a modern discipline at the crossroads of data mining and natural
language processing. It is concerned with the computational treatment of public moods …

On the mono-and cross-language detection of text reuse and plagiarism

A Barrón-Cedeño - Proceedings of the 33rd international ACM SIGIR …, 2010 - dl.acm.org
Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the
large amount of texts readily available. For instance, recent studies claim that nowadays a …

[PDF][PDF] JMaxAlign: A maximum entropy parallel sentence alignment tool

M Kaufmann - Proceedings of COLING 2012: Demonstration …, 2012 - aclanthology.org
Parallel corpora are an extremely useful tool in many natural language processing tasks,
particularly statistical machine translation. Parallel corpora for certain language pairs, such …

Hybrid distance-statistical-based phrase alignment for analyzing parallel texts in standard Malay and Malay dialects

JKY Min, TP Tan… - Malaysian Journal of …, 2024 - mjes.um.edu.my
Parallel texts corpora are essential resources in linguistics and natural language
processing, especially in translation and multilingual information retrieval. The publicly …

MultiWiki: Interlingual text passage alignment in Wikipedia

S Gottschalk, E Demidova - ACM Transactions on the Web (TWEB), 2017 - dl.acm.org
In this article, we address the problem of text passage alignment across interlingual article
pairs in Wikipedia. We develop methods that enable the identification and interlinking of text …

[PDF][PDF] Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia

D Ştefănescu, R Ion - Proceedings of the 14th International Conference …, 2013 - cicling.org
Parallel corpora are essential resources for certain Natural Language Processing tasks such
as Statistical Machine Translation. However, the existing publically available parallel …