Google Академія

G Dagan, G Synnaeve, B Roziere - arxiv preprint arxiv:2402.01035, 2024 - arxiv.org

Tokenization is an understudied and often neglected component of modern LLMs. Most
published works use a single tokenizer for all experiments, often borrowed from another …

Зберегти Послатися Цитовано в 26 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] unisi.it

Performance, energy consumption and costs: a comparative analysis of automatic text classification approaches in the Legal domain

L Rigutini, A Globo, M Stefanelli… - INTERNATIONAL …, 2024 - usiena-air.unisi.it

The common practice in Machine Learning research is to evaluate the top-performing
models based on their performance. However, this often leads to overlooking other crucial …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Generation with Dynamic Vocabulary

Y Liu, T Ji, C Sun, Y Wu, X Wang - arxiv preprint arxiv:2410.08481, 2024 - arxiv.org

We introduce a new dynamic vocabulary for language models. It can involve arbitrary text
spans during generation. These text spans act as basic generation bricks, akin to tokens in …

Зберегти Послатися Пов’язані статті Кількість версій: 3 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

L Gee, WY Li, V Sharmanska, N Quadrianto - arxiv preprint arxiv …, 2024 - arxiv.org

The cost of deploying vision transformers increasingly represents a barrier to wider industrial
adoption. Existing compression requires additional end-to-end fine-tuning or incurs a …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

E Asgari, YE Kheir, MAS Javaheri - arxiv preprint arxiv:2502.00894, 2025 - arxiv.org

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting
model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Multi-word tokenization for sequence compression

Getting the most out of your tokenizer for pre-training and domain adaptation

Performance, energy consumption and costs: a comparative analysis of automatic text classification approaches in the Legal domain

Generation with Dynamic Vocabulary

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies