Google Академик

K Slagle - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc

Tokenization is widely used in large language models because it significantly improves
performance. However, tokenization imposes several disadvantages, such as performance …

Сачувај Цитирај 4 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learn your tokens: word-pooled tokenization for language modeling

A Thawani, S Ghanekar, X Zhu, J Pujara - arxiv preprint arxiv:2310.11628, 2023 - arxiv.org

Language models typically tokenize text into subwords, using a deterministic, hand-
engineered heuristic of combining characters into longer surface-level strings such as' ing'or …

Сачувај Цитирај 7 пута наведен Сродни чланци Све верзије (7) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Enhancing Large Language Models through Adaptive Tokenizers

M Zheng, H Chen, T Guo, C Zhu… - Advances in …, 2025 - proceedings.neurips.cc

Tokenizers serve as crucial interfaces between models and linguistic data, substantially
influencing the efficacy and precision of large language models (LLMs). Traditional …

Сачувај Цитирај Сродни чланци HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From characters to words: Hierarchical pre-trained language model for open-vocabulary language understanding

L Sun, F Luisier, K Batmanghelich, D Florencio… - arxiv preprint arxiv …, 2023 - arxiv.org

Current state-of-the-art models for natural language understanding require a preprocessing
step to convert raw text into discrete tokens. This process known as tokenization relies on a …

Сачувај Цитирај 7 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Manta: Efficient gradient-based tokenization for robust end-to-end language modeling

N Godey, R Castagné, É de la Clergerie… - arxiv preprint arxiv …, 2022 - arxiv.org

Static subword tokenization algorithms have been an essential component of recent works
on language modeling. However, their static nature results in important flaws that degrade …

Сачувај Цитирај 9 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Beyond Literal Token Overlap: Token Alignability for Multilinguality

K Hämmerl, T Limisiewicz, J Libovický… - arxiv preprint arxiv …, 2025 - arxiv.org

Previous work has considered token overlap, or even similarity of token distributions, as
predictors for multilinguality and cross-lingual knowledge transfer in language models …

Сачувај Цитирај Сродни чланци HTML верзија

Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization

NZ Kayalı, Sİ Omurca - 2024 8th International Artificial …, 2024 - ieeexplore.ieee.org

Text summarization is a significant topic in natural language processing. Tokenization
approaches are important in this regard as they underpin text recognition and processing …

Сачувај Цитирај Сродни чланци Све верзије (3)

Nlp approaches for Cross Linguistic Information Retrieval from Tamil to English

G Rekha, D Malathi - AIP Conference Proceedings, 2024 - pubs.aip.org

With the use of Cross Linguistic Information Retrieval (CLIR) technology, users can ask a
question in one language and get answers to their request in a different language. For Tamil …

Сачувај Цитирај Сродни чланци Све верзије (3)

[Free GPT-4]
[DeepSeek]

[PDF] github.io

[PDF][PDF] What's new in Computational Linguistics: an overview

I Salogni - 2024 - ilariasalogni.github.io

I wrote this text as a paper for the Digital Culture Seminar of the Digital Humanities faculty of
the University of Pisa, taking the opportunity to deepen and complete in a personal research …

Сачувај Цитирај Сродни чланци HTML верзија

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

A vocabulary-free multilingual neural tokenizer for end-to-end task learning

Spacebyte: Towards deleting tokenization from large language modeling

Learn your tokens: word-pooled tokenization for language modeling

Enhancing Large Language Models through Adaptive Tokenizers

From characters to words: Hierarchical pre-trained language model for open-vocabulary language understanding

Manta: Efficient gradient-based tokenization for robust end-to-end language modeling

Beyond Literal Token Overlap: Token Alignability for Multilinguality

Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization

Nlp approaches for Cross Linguistic Information Retrieval from Tamil to English

[PDF][PDF] What's new in Computational Linguistics: an overview