Cross-lingual language model pretraining

A Conneau, G Lample - Advances in neural information …, 2019 - proceedings.neurips.cc
Recent studies have demonstrated the efficiency of generative pretraining for English
natural language understanding. In this work, we extend this approach to multiple …

XNLI: Evaluating cross-lingual sentence representations

A Conneau, G Lample, R Rinott, A Williams… - arxiv preprint arxiv …, 2018 - arxiv.org
State-of-the-art natural language processing systems rely on supervision in the form of
annotated data to learn competent models. These models are generally trained on data in a …

Learning word vectors for 157 languages

E Grave, P Bojanowski, P Gupta, A Joulin… - arxiv preprint arxiv …, 2018 - arxiv.org
Distributed word representations, or word vectors, have recently been applied to many tasks
in natural language processing, leading to state-of-the-art performance. A key ingredient to …

Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks

H Huang, Y Liang, N Duan, M Gong, L Shou… - arxiv preprint arxiv …, 2019 - arxiv.org
We present Unicoder, a universal language encoder that is insensitive to different
languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training …

Fast wordpiece tokenization

X Song, A Salcianu, Y Song, D Dopson… - arxiv preprint arxiv …, 2020 - arxiv.org
Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we
propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word …

[LIBRO][B] Handbook of natural language processing

N Indurkhya, FJ Damerau - 2010 - taylorfrancis.com
The Handbook of Natural Language Processing, Second Edition presents practical tools
and techniques for implementing natural language processing in computer systems. Along …

Cross-lingual natural language generation via pre-training

Z Chi, L Dong, F Wei, W Wang, XL Mao… - Proceedings of the AAAI …, 2020 - ojs.aaai.org
In this work we focus on transferring supervision signals of natural language generation
(NLG) tasks between multiple languages. We propose to pretrain the encoder and the …

Mining quality phrases from massive text corpora

J Liu, J Shang, C Wang, X Ren, J Han - Proceedings of the 2015 ACM …, 2015 - dl.acm.org
Text data are ubiquitous and play an essential role in big data applications. However, text
data are mostly unstructured. Transforming unstructured text into structured units (eg …

Malbertv2: Code aware bert-based model for malware identification

A Rahali, MA Akhloufi - Big Data and Cognitive Computing, 2023 - mdpi.com
To proactively mitigate malware threats, cybersecurity tools, such as anti-virus and anti-
malware software, as well as firewalls, require frequent updates and proactive …

Bi-directional LSTM recurrent neural network for Chinese word segmentation

Y Yao, Z Huang - … : 23rd International Conference, ICONIP 2016, Kyoto …, 2016 - Springer
Recurrent neural network (RNN) has been broadly applied to natural language process
(NLP) problems. This kind of neural network is designed for modeling sequential data and …