Word sense disambiguation: A survey

R Navigli - ACM computing surveys (CSUR), 2009 - dl.acm.org
Word sense disambiguation (WSD) is the ability to identify the meaning of words in context
in a computational manner. WSD is considered an AI-complete problem, that is, a task …

Statistical machine translation

A Lopez - ACM Computing Surveys (CSUR), 2008 - dl.acm.org
Statistical machine translation (SMT) treats the translation of natural language as a machine
learning problem. By examining many samples of human-produced translation, SMT …

[BUCH][B] Pretrained transformers for text ranking: Bert and beyond

J Lin, R Nogueira, A Yates - 2022 - books.google.com
The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in
response to a query. Although the most common formulation of text ranking is search …

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arxiv preprint arxiv …, 2019 - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arxiv preprint arxiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Revolt: Collaborative crowdsourcing for labeling machine learning datasets

JC Chang, S Amershi, E Kamar - … of the 2017 CHI conference on human …, 2017 - dl.acm.org
Crowdsourcing provides a scalable and efficient way to construct labeled datasets for
training machine learning systems. However, creating comprehensive label guidelines for …

CCAligned: A massive collection of cross-lingual web-document pairs

A El-Kishky, V Chaudhary, F Guzmán… - arxiv preprint arxiv …, 2019 - arxiv.org
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …

[BUCH][B] Translation-driven corpora: Corpus resources for descriptive and applied translation studies

F Zanettin - 2014 - taylorfrancis.com
Electronic texts and text analysis tools have opened up a wealth of opportunities to higher
education and language service providers, but learning to use these resources continues to …

[BUCH][B] Handbook of natural language processing

N Indurkhya, FJ Damerau - 2010 - taylorfrancis.com
The Handbook of Natural Language Processing, Second Edition presents practical tools
and techniques for implementing natural language processing in computer systems. Along …