Google Академія

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arxiv preprint arxiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Зберегти Послатися Цитовано в 116 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] mit.edu

Languages through the looking glass of BPE compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu

Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

Зберегти Послатися Цитовано в 21 джерелах Пов’язані статті Кількість версій: 8

[Free GPT-4]
[DeepSeek]

[PDF] ox.ac.uk

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers

V Hofmann, H Schuetze, JB Pierrehumbert - 2022 - ora.ox.ac.uk

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …

Зберегти Послатися Цитовано в 44 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language

S Zhang, B Frey, M Bansal - arxiv preprint arxiv:2204.11909, 2022 - arxiv.org

More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …

Зберегти Послатися Цитовано в 31 джерелах Пов’язані статті Кількість версій: 5 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DivEMT: Neural machine translation post-editing effort across typologically diverse languages

G Sarti, A Bisazza, AG Arenas, A Toral - arxiv preprint arxiv:2205.12215, 2022 - arxiv.org

We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …

Зберегти Послатися Цитовано в 12 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] plos.org

A survey on text classification: Practical perspectives on the Italian language

A Gasparetto, A Zangari, M Marcuzzo, A Albarelli - Plos one, 2022 - journals.plos.org

Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …

Зберегти Послатися Цитовано в 9 джерелах Пов’язані статті Кількість версій: 9 Кеш

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Beyond characters: Subword-level morpheme segmentation

B Peters, AFT Martins - … of the 19th SIGMORPHON Workshop on …, 2022 - aclanthology.org

This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …

Зберегти Послатися Цитовано в 11 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Quantifying synthesis and fusion and their impact on machine translation

A Oncevay, D Ataman, N Van Berkel, B Haddow… - arxiv preprint arxiv …, 2022 - arxiv.org

Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …

Зберегти Послатися Цитовано в 6 джерелах Пов’язані статті Кількість версій: 9 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[HTML] mdpi.com

[HTML][HTML] DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain

H Pires, L Paucar, JP Carvalho - Big Data and Cognitive Computing, 2025 - mdpi.com

The complex and specialized terminology of financial language in Portuguese-speaking
markets create significant challenges for natural language processing (NLP) applications …

Зберегти Послатися Пов’язані статті Кеш

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Impact of subword pooling strategy on cross-lingual event detection

S Agarwal, S Fincke, C Jenkins, S Miller… - arxiv preprint arxiv …, 2023 - arxiv.org

Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 3 Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

How suitable are subword segmentation strategies for translating non-concatenative morphology?

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

Languages through the looking glass of BPE compression

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers

How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language

DivEMT: Neural machine translation post-editing effort across typologically diverse languages

A survey on text classification: Practical perspectives on the Italian language

Beyond characters: Subword-level morpheme segmentation

Quantifying synthesis and fusion and their impact on machine translation

[HTML][HTML] DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain

Impact of subword pooling strategy on cross-lingual event detection