Linguistically inspired roadmap for building biologically reliable protein language models

MH Vu, R Akbar, PA Robert, B Swiatczak… - Nature Machine …, 2023 - nature.com
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arxiv preprint arxiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Llms are good sign language translators

J Gong, LG Foo, Y He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Sign Language Translation (SLT) is a challenging task that aims to translate sign
videos into spoken language. Inspired by the strong translation capabilities of large …

Lbpe: Long-token-first tokenization to improve large language models

H Lian, Y **ong, Z Lin, J Niu, S Mo, H Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates
robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its …

Subword evenness (sue) as a predictor of cross-lingual transfer to low-resource languages

O Pelloni, A Shaitarova… - Proceedings of the 2022 …, 2022 - aclanthology.org
Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the
performance on various tasks in low-resource languages via cross-lingual transfer. In this …

Interpreting character embeddings with perceptual representations: The case of shape, sound, and color

S Boldsen, M Agirrezabal… - Proceedings of the 60th …, 2022 - aclanthology.org
Character-level information is included in many NLP models, but evaluating the information
encoded in character representations is an open issue. We leverage perceptual …

Languages through the looking glass of bpe compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

Dialect representation learning with neural dialect-to-standard normalization

O Kuparinen, Y Scherrer - Tenth Workshop on NLP for Similar …, 2023 - aclanthology.org
Abstract Language label tokens are often used in multilingual neural language modeling
and sequence-to-sequence learning to enhance the performance of such models. An …

TeDDi sample: Text data diversity sample for language comparison and multilingual NLP

S Moran, C Bentz, X Gutierrez-Vasques… - Proceedings of the …, 2022 - aclanthology.org
We present the TeDDi sample, a diversity sample of text data for language comparison and
multilingual Natural Language Processing. The TeDDi sample currently features 89 …

Are you talking to ['xem'] or ['x','em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity

A Ovalle, N Mehrabi, P Goyal, J Dhamala… - arxiv preprint arxiv …, 2023 - arxiv.org
A large body of NLP research has documented the ways gender biases manifest and
amplify within large language models (LLMs), though this research has predominantly …