Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …
can be analyzed and generated at many granularities. Until recently, most natural language …
Languages through the looking glass of BPE compression
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …
How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language
More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …
DivEMT: Neural machine translation post-editing effort across typologically diverse languages
We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …
A survey on text classification: Practical perspectives on the Italian language
Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …
Beyond characters: Subword-level morpheme segmentation
This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …
Quantifying synthesis and fusion and their impact on machine translation
Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …
[HTML][HTML] DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain
The complex and specialized terminology of financial language in Portuguese-speaking
markets create significant challenges for natural language processing (NLP) applications …
markets create significant challenges for natural language processing (NLP) applications …
Impact of subword pooling strategy on cross-lingual event detection
Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …