Spacebyte: Towards deleting tokenization from large language modeling

K Slagle - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc
Tokenization is widely used in large language models because it significantly improves
performance. However, tokenization imposes several disadvantages, such as performance …

Learn your tokens: word-pooled tokenization for language modeling

A Thawani, S Ghanekar, X Zhu, J Pujara - arxiv preprint arxiv:2310.11628, 2023 - arxiv.org
Language models typically tokenize text into subwords, using a deterministic, hand-
engineered heuristic of combining characters into longer surface-level strings such as' ing'or …

Enhancing Large Language Models through Adaptive Tokenizers

M Zheng, H Chen, T Guo, C Zhu… - Advances in …, 2025 - proceedings.neurips.cc
Tokenizers serve as crucial interfaces between models and linguistic data, substantially
influencing the efficacy and precision of large language models (LLMs). Traditional …

From characters to words: Hierarchical pre-trained language model for open-vocabulary language understanding

L Sun, F Luisier, K Batmanghelich, D Florencio… - arxiv preprint arxiv …, 2023 - arxiv.org
Current state-of-the-art models for natural language understanding require a preprocessing
step to convert raw text into discrete tokens. This process known as tokenization relies on a …

Manta: Efficient gradient-based tokenization for robust end-to-end language modeling

N Godey, R Castagné, É de la Clergerie… - arxiv preprint arxiv …, 2022 - arxiv.org
Static subword tokenization algorithms have been an essential component of recent works
on language modeling. However, their static nature results in important flaws that degrade …

Beyond Literal Token Overlap: Token Alignability for Multilinguality

K Hämmerl, T Limisiewicz, J Libovický… - arxiv preprint arxiv …, 2025 - arxiv.org
Previous work has considered token overlap, or even similarity of token distributions, as
predictors for multilinguality and cross-lingual knowledge transfer in language models …

Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization

NZ Kayalı, Sİ Omurca - 2024 8th International Artificial …, 2024 - ieeexplore.ieee.org
Text summarization is a significant topic in natural language processing. Tokenization
approaches are important in this regard as they underpin text recognition and processing …

Nlp approaches for Cross Linguistic Information Retrieval from Tamil to English

G Rekha, D Malathi - AIP Conference Proceedings, 2024 - pubs.aip.org
With the use of Cross Linguistic Information Retrieval (CLIR) technology, users can ask a
question in one language and get answers to their request in a different language. For Tamil …

[PDF][PDF] What's new in Computational Linguistics: an overview

I Salogni - 2024 - ilariasalogni.github.io
I wrote this text as a paper for the Digital Culture Seminar of the Digital Humanities faculty of
the University of Pisa, taking the opportunity to deepen and complete in a personal research …