Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Strong prediction: Language model surprisal explains multiple N400 effects
Theoretical accounts of the N400 are divided as to whether the amplitude of the N400
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …
Tokenization is more than compression
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …
Tokenizer choice for llm training: Negligible or crucial?
The recent success of large language models (LLMs) has been predominantly driven by
curating the training dataset composition, scaling of model architectures and dataset sizes …
curating the training dataset composition, scaling of model architectures and dataset sizes …
Unpacking tokenization: Evaluating text compression and its correlation with model performance
Despite it being the cornerstone of BPE, the most common tokenization algorithm, the
importance of compression in the tokenization process is still unclear. In this paper, we …
importance of compression in the tokenization process is still unclear. In this paper, we …
Greed is all you need: An evaluation of tokenizer inference methods
While subword tokenizers such as BPE and WordPiece are typically used to build
vocabularies for NLP models, the method of decoding text into a sequence of tokens from …
vocabularies for NLP models, the method of decoding text into a sequence of tokens from …
Analyzing cognitive plausibility of subword tokenization
Subword tokenization has become the de-facto standard for tokenization, although
comparative evaluations of subword vocabulary quality across languages are scarce …
comparative evaluations of subword vocabulary quality across languages are scarce …
Improving tokenisation by alternative treatment of spaces
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based
language models all use subword tokenisation algorithms to process input text. Existing …
language models all use subword tokenisation algorithms to process input text. Existing …
From Tokens to Words: On the Inner Lexicon of LLMs
Natural language is composed of words, but modern LLMs process sub-words as input. A
natural question raised by this discrepancy is whether LLMs encode words internally, and if …
natural question raised by this discrepancy is whether LLMs encode words internally, and if …
Subword Segmentation in LLMs: Looking at Inflection and Consistency
M Marco, A Fraser - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org
The role of subword segmentation in relation to capturing morphological patterns in LLMs is
currently not well explored. Ideally, one would train models like GPT using various …
currently not well explored. Ideally, one would train models like GPT using various …
Tokenization matters: Navigating data-scarce tokenization for gender inclusive language technologies
Gender-inclusive NLP research has documented the harmful limitations of gender binary-
centric large language models (LLM), such as the inability to correctly use gender-diverse …
centric large language models (LLM), such as the inability to correctly use gender-diverse …