Encoder-decoder methods for text normalization

M Lusetti, T Ruzsics, A Göhring, T Samardžić, E Stark - 2018 - zora.uzh.ch
Text normalization is the task of map** non-canonical language, typical of speech
transcription and computer-mediated communication, to a standardized writing. It is an up …

Dialect-to-standard normalization: A large-scale multilingual evaluation

O Kuparinen, A Miletić, Y Scherrer - Findings of the Association for …, 2023 - aclanthology.org
Text normalization methods have been commonly applied to historical language or user-
generated content, but less often to dialectal transcriptions. In this paper, we introduce …

All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch

O De Clercq, V Hoste - Computational Linguistics, 2016 - direct.mit.edu
Readability research has a long and rich tradition, but there has been too little focus on
general readability prediction without targeting a specific audience or text genre. Moreover …

[PDF][PDF] Normalizing tweets with edit scripts and recurrent neural embeddings

G Chrupała - Proceedings of the 52nd Annual Meeting of the …, 2014 - aclanthology.org
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words
and other non-canonical language. These features are problematic for standard language …

Digitising Swiss German: how to process and study a polycentric spoken language

Y Scherrer, T Samardžić, E Glaser - Language Resources and Evaluation, 2019 - Springer
Swiss dialects of German are, unlike many dialects of other standardised languages, widely
used in everyday communication. Despite this fact, automatic processing of Swiss German is …

[PDF][PDF] Normalising Slovene data: historical texts vs. user-generated content

N Ljubešic, K Zupan, D Fišer, T Erjavec - Proceedings of the 13th …, 2016 - academia.edu
The paper presents two manually annotated Slovene language text normalisation datasets,
one of historical texts and the other of tweets, and proposes several variants of character …

Social media text normalization for Turkish

G ERYİǦİT… - Natural Language …, 2017 - cambridge.org
Text normalization is an indispensable stage in processing noncanonical language from
natural sources, such as speech, social media or short text messages. Research in this field …

Multi-modular domain-tailored OCR post-correction

S Schulz, J Kuhn - Proceedings of the 2017 Conference on …, 2017 - aclanthology.org
One of the main obstacles for many Digital Humanities projects is the low data availability.
Texts have to be digitized in an expensive and time consuming process whereas Optical …

[PDF][PDF] Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation

Y Scherrer, N Ljubešic - Proceedings of the 13th conference on …, 2016 - academia.edu
Abstract The Swiss German dialect corpus Archi-Mob poses great challenges for NLP and
corpus linguistic research due to the massive amount of variation found in the transcriptions …

[HTML][HTML] Graph-based Turkish text normalization and its impact on noisy text processing

S Demir, B Topcu - Engineering Science and Technology, an International …, 2022 - Elsevier
User generated texts on the web are freely-available and lucrative sources of data for
language technology researchers. Unfortunately, these texts are often dominated by …