Word sense disambiguation: A survey
R Navigli - ACM computing surveys (CSUR), 2009 - dl.acm.org
Word sense disambiguation (WSD) is the ability to identify the meaning of words in context
in a computational manner. WSD is considered an AI-complete problem, that is, a task …
in a computational manner. WSD is considered an AI-complete problem, that is, a task …
Statistical machine translation
A Lopez - ACM Computing Surveys (CSUR), 2008 - dl.acm.org
Statistical machine translation (SMT) treats the translation of natural language as a machine
learning problem. By examining many samples of human-produced translation, SMT …
learning problem. By examining many samples of human-produced translation, SMT …
[BUCH][B] Pretrained transformers for text ranking: Bert and beyond
The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in
response to a query. Although the most common formulation of text ranking is search …
response to a query. Although the most common formulation of text ranking is search …
Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …
ParaCrawl: Web-scale acquisition of parallel corpora
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …
the web, using open source software. We empirically compare alternative methods and …
CCMatrix: Mining billions of high-quality parallel sentences on the web
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
Revolt: Collaborative crowdsourcing for labeling machine learning datasets
Crowdsourcing provides a scalable and efficient way to construct labeled datasets for
training machine learning systems. However, creating comprehensive label guidelines for …
training machine learning systems. However, creating comprehensive label guidelines for …
CCAligned: A massive collection of cross-lingual web-document pairs
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …
languages that are of comparable content or translations of each other. In this paper, we …
[BUCH][B] Translation-driven corpora: Corpus resources for descriptive and applied translation studies
F Zanettin - 2014 - taylorfrancis.com
Electronic texts and text analysis tools have opened up a wealth of opportunities to higher
education and language service providers, but learning to use these resources continues to …
education and language service providers, but learning to use these resources continues to …
[BUCH][B] Handbook of natural language processing
N Indurkhya, FJ Damerau - 2010 - taylorfrancis.com
The Handbook of Natural Language Processing, Second Edition presents practical tools
and techniques for implementing natural language processing in computer systems. Along …
and techniques for implementing natural language processing in computer systems. Along …