Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

M Artetxe, H Schwenk - … of the association for computational linguistics, 2019 - direct.mit.edu
We introduce an architecture to learn joint multilingual sentence representations for 93
languages, belonging to more than 30 different families and written in 28 different scripts …

SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings

MJ Sabet, P Dufter, F Yvon, H Schütze - arxiv preprint arxiv:2004.08728, 2020 - arxiv.org
Word alignments are useful for tasks like statistical and neural machine translation (NMT)
and cross-lingual annotation projection. Statistical word aligners perform well, as do …

A crosslingual investigation of conceptualization in 1335 languages

Y Liu, H Ye, L Weissweiler, P Wicke, R Pei… - arxiv preprint arxiv …, 2023 - arxiv.org
Languages differ in how they divide up the world into concepts and words; eg, in contrast to
English, Swahili has a single concept forbelly'andwomb'. We investigate these differences in …

Gpu-based private information retrieval for on-device machine learning inference

M Lam, J Johnson, W **ong, K Maeng, U Gupta… - arxiv preprint arxiv …, 2023 - arxiv.org
On-device machine learning (ML) inference can enable the use of private user data on user
devices without revealing them to remote servers. However, a pure on-device solution to …

Crosslingual transfer learning for low-resource languages based on multilingual colexification graphs

Y Liu, H Ye, L Weissweiler, R Pei… - arxiv preprint arxiv …, 2023 - arxiv.org
In comparative linguistics, colexification refers to the phenomenon of a lexical form
conveying two or more distinct meanings. Existing work on colexification patterns relies on …

Graph-based multilingual label propagation for low-resource part-of-speech tagging

A Imani, S Severini, MJ Sabet, F Yvon… - arxiv preprint arxiv …, 2022 - arxiv.org
Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-
resource languages lack labeled data for training. An established method for training a POS …

Topology of word embeddings: Singularities reflect polysemy

A Jakubowski, M Gašić, M Zibrowius - arxiv preprint arxiv:2011.09413, 2020 - arxiv.org
The manifold hypothesis suggests that word vectors live on a submanifold within their
ambient vector space. We argue that we should, more accurately, expect them to live on a …

A multilingual BPE embedding space for universal sentiment lexicon induction

M Zhao, H Schütze - 2019 - epub.ub.uni-muenchen.de
We present a new method for sentiment lexicon induction that is designed to be applicable
to the entire range of typological diversity of the world's languages. We evaluate our method …

Graph algorithms for multiparallel word alignment

A Imani, MJ Sabet, LK Şenel, P Dufter, F Yvon… - arxiv preprint arxiv …, 2021 - arxiv.org
With the advent of end-to-end deep learning approaches in machine translation, interest in
word alignments initially decreased; however, they have again become a focus of research …

Learning contextualised cross-lingual word embeddings and alignments for extremely low-resource languages using parallel corpora

T Wada, T Iwata, Y Matsumoto, T Baldwin… - arxiv preprint arxiv …, 2020 - arxiv.org
We propose a new approach for learning contextualised cross-lingual word embeddings
based on a small parallel corpus (eg a few hundred sentence pairs). Our method obtains …