Strong prediction: Language model surprisal explains multiple N400 effects

JA Michaelov, MD Bardolph, CK Van Petten… - Neurobiology of …, 2024 - direct.mit.edu
Theoretical accounts of the N400 are divided as to whether the amplitude of the N400
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …

Tokenization is more than compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - arxiv preprint arxiv …, 2024 - arxiv.org
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …

Tokenizer choice for llm training: Negligible or crucial?

M Ali, M Fromm, K Thellmann, R Rutmann… - Findings of the …, 2024 - aclanthology.org
The recent success of large language models (LLMs) has been predominantly driven by
curating the training dataset composition, scaling of model architectures and dataset sizes …

Unpacking tokenization: Evaluating text compression and its correlation with model performance

O Goldman, A Caciularu, M Eyal, K Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite it being the cornerstone of BPE, the most common tokenization algorithm, the
importance of compression in the tokenization process is still unclear. In this paper, we …

Greed is all you need: An evaluation of tokenizer inference methods

O Uzan, CW Schmidt, C Tanner, Y Pinter - arxiv preprint arxiv:2403.01289, 2024 - arxiv.org
While subword tokenizers such as BPE and WordPiece are typically used to build
vocabularies for NLP models, the method of decoding text into a sequence of tokens from …

Analyzing cognitive plausibility of subword tokenization

L Beinborn, Y Pinter - arxiv preprint arxiv:2310.13348, 2023 - arxiv.org
Subword tokenization has become the de-facto standard for tokenization, although
comparative evaluations of subword vocabulary quality across languages are scarce …

Improving tokenisation by alternative treatment of spaces

E Gow-Smith, HT Madabushi, C Scarton… - arxiv preprint arxiv …, 2022 - arxiv.org
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based
language models all use subword tokenisation algorithms to process input text. Existing …

From Tokens to Words: On the Inner Lexicon of LLMs

G Kaplan, M Oren, Y Reif, R Schwartz - arxiv preprint arxiv:2410.05864, 2024 - arxiv.org
Natural language is composed of words, but modern LLMs process sub-words as input. A
natural question raised by this discrepancy is whether LLMs encode words internally, and if …

Subword Segmentation in LLMs: Looking at Inflection and Consistency

M Marco, A Fraser - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org
The role of subword segmentation in relation to capturing morphological patterns in LLMs is
currently not well explored. Ideally, one would train models like GPT using various …

Tokenization matters: Navigating data-scarce tokenization for gender inclusive language technologies

A Ovalle, N Mehrabi, P Goyal, J Dhamala… - arxiv preprint arxiv …, 2023 - arxiv.org
Gender-inclusive NLP research has documented the harmful limitations of gender binary-
centric large language models (LLM), such as the inability to correctly use gender-diverse …