Getting the most out of your tokenizer for pre-training and domain adaptation

G Dagan, G Synnaeve, B Roziere - arxiv preprint arxiv:2402.01035, 2024 - arxiv.org
Tokenization is an understudied and often neglected component of modern LLMs. Most
published works use a single tokenizer for all experiments, often borrowed from another …

Performance, energy consumption and costs: a comparative analysis of automatic text classification approaches in the Legal domain

L Rigutini, A Globo, M Stefanelli… - INTERNATIONAL …, 2024 - usiena-air.unisi.it
The common practice in Machine Learning research is to evaluate the top-performing
models based on their performance. However, this often leads to overlooking other crucial …

Generation with Dynamic Vocabulary

Y Liu, T Ji, C Sun, Y Wu, X Wang - arxiv preprint arxiv:2410.08481, 2024 - arxiv.org
We introduce a new dynamic vocabulary for language models. It can involve arbitrary text
spans during generation. These text spans act as basic generation bricks, akin to tokens in …

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

L Gee, WY Li, V Sharmanska, N Quadrianto - arxiv preprint arxiv …, 2024 - arxiv.org
The cost of deploying vision transformers increasingly represents a barrier to wider industrial
adoption. Existing compression requires additional end-to-end fine-tuning or incurs a …

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

E Asgari, YE Kheir, MAS Javaheri - arxiv preprint arxiv:2502.00894, 2025 - arxiv.org
Tokenization is fundamental to Natural Language Processing (NLP), directly impacting
model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in …