Cross-lingual language model pretraining
Recent studies have demonstrated the efficiency of generative pretraining for English
natural language understanding. In this work, we extend this approach to multiple …
natural language understanding. In this work, we extend this approach to multiple …
XNLI: Evaluating cross-lingual sentence representations
State-of-the-art natural language processing systems rely on supervision in the form of
annotated data to learn competent models. These models are generally trained on data in a …
annotated data to learn competent models. These models are generally trained on data in a …
Learning word vectors for 157 languages
Distributed word representations, or word vectors, have recently been applied to many tasks
in natural language processing, leading to state-of-the-art performance. A key ingredient to …
in natural language processing, leading to state-of-the-art performance. A key ingredient to …
Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks
We present Unicoder, a universal language encoder that is insensitive to different
languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training …
languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training …
Fast wordpiece tokenization
Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we
propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word …
propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word …
[LIBRO][B] Handbook of natural language processing
N Indurkhya, FJ Damerau - 2010 - taylorfrancis.com
The Handbook of Natural Language Processing, Second Edition presents practical tools
and techniques for implementing natural language processing in computer systems. Along …
and techniques for implementing natural language processing in computer systems. Along …
Cross-lingual natural language generation via pre-training
In this work we focus on transferring supervision signals of natural language generation
(NLG) tasks between multiple languages. We propose to pretrain the encoder and the …
(NLG) tasks between multiple languages. We propose to pretrain the encoder and the …
Mining quality phrases from massive text corpora
Text data are ubiquitous and play an essential role in big data applications. However, text
data are mostly unstructured. Transforming unstructured text into structured units (eg …
data are mostly unstructured. Transforming unstructured text into structured units (eg …
Malbertv2: Code aware bert-based model for malware identification
To proactively mitigate malware threats, cybersecurity tools, such as anti-virus and anti-
malware software, as well as firewalls, require frequent updates and proactive …
malware software, as well as firewalls, require frequent updates and proactive …
Bi-directional LSTM recurrent neural network for Chinese word segmentation
Y Yao, Z Huang - … : 23rd International Conference, ICONIP 2016, Kyoto …, 2016 - Springer
Recurrent neural network (RNN) has been broadly applied to natural language process
(NLP) problems. This kind of neural network is designed for modeling sequential data and …
(NLP) problems. This kind of neural network is designed for modeling sequential data and …