Text algorithms in economics

E Ash, S Hansen - Annual Review of Economics, 2023 - annualreviews.org
This article provides an overview of the methods used for algorithmic text analysis in
economics, with a focus on three key contributions. First, we introduce methods for …

A survey of multilingual neural machine translation

R Dabre, C Chu, A Kunchukuttan - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
We present a survey on multilingual neural machine translation (MNMT), which has gained
a lot of traction in recent years. MNMT has been useful in improving translation quality as a …

Text embeddings by weakly-supervised contrastive pre-training

L Wang, N Yang, X Huang, B Jiao, L Yang… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a
wide range of tasks. The model is trained in a contrastive manner with weak supervision …

Beyond english-centric multilingual machine translation

A Fan, S Bhosale, H Schwenk, Z Ma, A El-Kishky… - Journal of Machine …, 2021 - jmlr.org
Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …

COMET: A neural framework for MT evaluation

R Rei, C Stewart, AC Farinha, A Lavie - arxiv preprint arxiv:2009.09025, 2020 - arxiv.org
We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …

Language-agnostic BERT sentence embedding

F Feng, Y Yang, D Cer, N Arivazhagan… - arxiv preprint arxiv …, 2020 - arxiv.org
While BERT is an effective method for learning monolingual sentence embeddings for
semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019) …

VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

C Wang, M Riviere, A Lee, A Wu, C Talnikar… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of
unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised …

Making monolingual sentence embeddings multilingual using knowledge distillation

N Reimers, I Gurevych - arxiv preprint arxiv:2004.09813, 2020 - arxiv.org
We present an easy and efficient method to extend existing sentence embedding models to
new languages. This allows to create multilingual versions from previously monolingual …

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

J Hu, S Ruder, A Siddhant, G Neubig… - International …, 2020 - proceedings.mlr.press
Much recent progress in applications of machine learning models to NLP has been driven
by benchmarks that evaluate models across a wide variety of tasks. However, these broad …

The state and fate of linguistic diversity and inclusion in the NLP world

P Joshi, S Santy, A Budhiraja, K Bali… - arxiv preprint arxiv …, 2020 - arxiv.org
Language technologies contribute to promoting multilingualism and linguistic diversity
around the world. However, only a very small number of the over 7000 languages of the …