Tokenization is more than compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - ar** llm-driven testsuite for compiler validation
C Munley, A Jarmusch, S Chandrasekaran - Future Generation Computer …, 2024‏ - Elsevier
Large language models (LLMs) are a new and powerful tool for a wide span of applications
involving natural language and demonstrate impressive code generation abilities. The goal …

An Analysis of Tokenization: Transformers under Markov Data

N Rajaraman, J Jiao… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
While there has been a large body of research attempting to circumvent tokenization for
language modeling (Clark et al. 2022, Xue et al. 2022), the current consensus is that it is a …

Deep Learning and Machine Learning--Natural Language Processing: From Theory to Application

K Chen, C Fei, Z Bi, J Liu, B Peng, S Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
With a focus on natural language processing (NLP) and the role of large language models
(LLMs), we explore the intersection of machine learning, deep learning, and artificial …

The foundations of tokenization: Statistical and computational concerns

JL Gastaldi, J Terilla, L Malagutti, B DuSell… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Tokenization-the practice of converting strings of characters from an alphabet into
sequences of tokens over a vocabulary-is a critical step in the NLP pipeline. The use of …

Towards objective and unbiased decision assessments with llm-enhanced hierarchical attention networks

J Liu, KH Lim, RKW Lee - arxiv preprint arxiv:2411.08504, 2024‏ - arxiv.org
How objective and unbiased are we while making decisions? This work investigates
cognitive bias identification in high-stake decision making process by human experts …

Theoretical Analysis of Byte-Pair Encoding

L Kozma, J Voderholzer - arxiv preprint arxiv:2411.08671, 2024‏ - arxiv.org
Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in
grammar-based text compression. It is employed in a variety of language processing tasks …