- Academic Search

JA Michaelov, MD Bardolph, CK Van Petten… - Neurobiology of …, 2024 - direct.mit.edu

Theoretical accounts of the N400 are divided as to whether the amplitude of the N400
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …

Uložit Citovat Počet citací tohoto článku: 39 Související články Všechny verze (počet: 11) Hledat knihovnu

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tokenization is more than compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - arxiv preprint arxiv …, 2024 - arxiv.org

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …

Uložit Citovat Počet citací tohoto článku: 24 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Tokenizer choice for llm training: Negligible or crucial?

M Ali, M Fromm, K Thellmann, R Rutmann… - Findings of the …, 2024 - aclanthology.org

The recent success of large language models (LLMs) has been predominantly driven by
curating the training dataset composition, scaling of model architectures and dataset sizes …

Uložit Citovat Počet citací tohoto článku: 28 Související články Všechny verze (počet: 8) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Unpacking tokenization: Evaluating text compression and its correlation with model performance

O Goldman, A Caciularu, M Eyal, K Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the
importance of compression in the tokenization process is still unclear. In this paper, we …

Uložit Citovat Počet citací tohoto článku: 11 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Greed is all you need: An evaluation of tokenizer inference methods

O Uzan, CW Schmidt, C Tanner, Y Pinter - arxiv preprint arxiv:2403.01289, 2024 - arxiv.org

While subword tokenizers such as BPE and WordPiece are typically used to build
vocabularies for NLP models, the method of decoding text into a sequence of tokens from …

Uložit Citovat Počet citací tohoto článku: 8 Související články Všechny verze (počet: 5) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Analyzing cognitive plausibility of subword tokenization

L Beinborn, Y Pinter - arxiv preprint arxiv:2310.13348, 2023 - arxiv.org

Subword tokenization has become the de-facto standard for tokenization, although
comparative evaluations of subword vocabulary quality across languages are scarce …

Uložit Citovat Počet citací tohoto článku: 16 Související články Všechny verze (počet: 8) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Improving tokenisation by alternative treatment of spaces

E Gow-Smith, HT Madabushi, C Scarton… - arxiv preprint arxiv …, 2022 - arxiv.org

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based
language models all use subword tokenisation algorithms to process input text. Existing …

Uložit Citovat Počet citací tohoto článku: 14 Související články Všechny verze (počet: 10) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From Tokens to Words: On the Inner Lexicon of LLMs

G Kaplan, M Oren, Y Reif, R Schwartz - arxiv preprint arxiv:2410.05864, 2024 - arxiv.org

Natural language is composed of words, but modern LLMs process sub-words as input. A
natural question raised by this discrepancy is whether LLMs encode words internally, and if …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Subword Segmentation in LLMs: Looking at Inflection and Consistency

M Marco, A Fraser - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org

The role of subword segmentation in relation to capturing morphological patterns in LLMs is
currently not well explored. Ideally, one would train models like GPT using various …

Uložit Citovat Počet citací tohoto článku: 1 Související články Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tokenization matters: Navigating data-scarce tokenization for gender inclusive language technologies

A Ovalle, N Mehrabi, P Goyal, J Dhamala… - arxiv preprint arxiv …, 2023 - arxiv.org

Gender-inclusive NLP research has documented the harmful limitations of gender binary-
centric large language models (LLM), such as the inability to correctly use gender-diverse …

Uložit Citovat Počet citací tohoto článku: 6 Související články Všechny verze (počet: 7) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Incorporating context into subword vocabularies

Strong prediction: Language model surprisal explains multiple N400 effects

Tokenization is more than compression

Tokenizer choice for llm training: Negligible or crucial?

Unpacking tokenization: Evaluating text compression and its correlation with model performance

Greed is all you need: An evaluation of tokenizer inference methods

Analyzing cognitive plausibility of subword tokenization

Improving tokenisation by alternative treatment of spaces

From Tokens to Words: On the Inner Lexicon of LLMs

Subword Segmentation in LLMs: Looking at Inflection and Consistency

Tokenization matters: Navigating data-scarce tokenization for gender inclusive language technologies