Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Spacebyte: Towards deleting tokenization from large language modeling
K Slagle - Advances in Neural Information Processing …, 2025 - proceedings.neurips.cc
Tokenization is widely used in large language models because it significantly improves
performance. However, tokenization imposes several disadvantages, such as performance …
performance. However, tokenization imposes several disadvantages, such as performance …
Learn your tokens: word-pooled tokenization for language modeling
Language models typically tokenize text into subwords, using a deterministic, hand-
engineered heuristic of combining characters into longer surface-level strings such as' ing'or …
engineered heuristic of combining characters into longer surface-level strings such as' ing'or …
Enhancing Large Language Models through Adaptive Tokenizers
Tokenizers serve as crucial interfaces between models and linguistic data, substantially
influencing the efficacy and precision of large language models (LLMs). Traditional …
influencing the efficacy and precision of large language models (LLMs). Traditional …
From characters to words: Hierarchical pre-trained language model for open-vocabulary language understanding
Current state-of-the-art models for natural language understanding require a preprocessing
step to convert raw text into discrete tokens. This process known as tokenization relies on a …
step to convert raw text into discrete tokens. This process known as tokenization relies on a …
Manta: Efficient gradient-based tokenization for robust end-to-end language modeling
Static subword tokenization algorithms have been an essential component of recent works
on language modeling. However, their static nature results in important flaws that degrade …
on language modeling. However, their static nature results in important flaws that degrade …
Beyond Literal Token Overlap: Token Alignability for Multilinguality
Previous work has considered token overlap, or even similarity of token distributions, as
predictors for multilinguality and cross-lingual knowledge transfer in language models …
predictors for multilinguality and cross-lingual knowledge transfer in language models …
Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization
Text summarization is a significant topic in natural language processing. Tokenization
approaches are important in this regard as they underpin text recognition and processing …
approaches are important in this regard as they underpin text recognition and processing …
Nlp approaches for Cross Linguistic Information Retrieval from Tamil to English
G Rekha, D Malathi - AIP Conference Proceedings, 2024 - pubs.aip.org
With the use of Cross Linguistic Information Retrieval (CLIR) technology, users can ask a
question in one language and get answers to their request in a different language. For Tamil …
question in one language and get answers to their request in a different language. For Tamil …
[PDF][PDF] What's new in Computational Linguistics: an overview
I Salogni - 2024 - ilariasalogni.github.io
I wrote this text as a paper for the Digital Culture Seminar of the Digital Humanities faculty of
the University of Pisa, taking the opportunity to deepen and complete in a personal research …
the University of Pisa, taking the opportunity to deepen and complete in a personal research …