Linguistically inspired roadmap for building biologically reliable protein language models
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …
protein sequence data to predict protein function. However, being largely black-box models …
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …
can be analyzed and generated at many granularities. Until recently, most natural language …
Llms are good sign language translators
Abstract Sign Language Translation (SLT) is a challenging task that aims to translate sign
videos into spoken language. Inspired by the strong translation capabilities of large …
videos into spoken language. Inspired by the strong translation capabilities of large …
Lbpe: Long-token-first tokenization to improve large language models
The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates
robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its …
robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its …
Subword evenness (sue) as a predictor of cross-lingual transfer to low-resource languages
Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the
performance on various tasks in low-resource languages via cross-lingual transfer. In this …
performance on various tasks in low-resource languages via cross-lingual transfer. In this …
Interpreting character embeddings with perceptual representations: The case of shape, sound, and color
S Boldsen, M Agirrezabal… - Proceedings of the 60th …, 2022 - aclanthology.org
Character-level information is included in many NLP models, but evaluating the information
encoded in character representations is an open issue. We leverage perceptual …
encoded in character representations is an open issue. We leverage perceptual …
Languages through the looking glass of bpe compression
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
Dialect representation learning with neural dialect-to-standard normalization
Abstract Language label tokens are often used in multilingual neural language modeling
and sequence-to-sequence learning to enhance the performance of such models. An …
and sequence-to-sequence learning to enhance the performance of such models. An …
TeDDi sample: Text data diversity sample for language comparison and multilingual NLP
We present the TeDDi sample, a diversity sample of text data for language comparison and
multilingual Natural Language Processing. The TeDDi sample currently features 89 …
multilingual Natural Language Processing. The TeDDi sample currently features 89 …
Are you talking to ['xem'] or ['x','em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity
A large body of NLP research has documented the ways gender biases manifest and
amplify within large language models (LLMs), though this research has predominantly …
amplify within large language models (LLMs), though this research has predominantly …