OCR4all—An open-source tool providing a (semi-) automatic OCR workflow for historical printings

C Reul, D Christ, A Hartelt, N Balbach, M Wehner… - Applied Sciences, 2019 - mdpi.com
Optical Character Recognition (OCR) on historical printings is a challenging task mainly due
to the complexity of the layout and the highly variant typography. Nevertheless, in the last …

Towards realistic practices in low-resource natural language processing: The development set

K Kann, K Cho, SR Bowman - arxiv preprint arxiv:1909.01522, 2019 - arxiv.org
Development sets are impractical to obtain for real low-resource languages, since using all
available data for training is often more effective. However, development sets are widely …

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

U Springmann, C Reul, S Dipper, J Baiter - arxiv preprint arxiv …, 2018 - arxiv.org
In this paper we describe a dataset of German and Latin\textit {ground truth}(GT) for
historical OCR in the form of printed text line images paired with their transcription. This …

[ΒΙΒΛΙΟ][B] Multilayer corpus studies

A Zeldes - 2018 - taylorfrancis.com
This volume explores the opportunities afforded by the construction and evaluation of
multilayer corpora, an emerging methodology within corpus linguistics that brings about …

[PDF][PDF] Normalization of historical texts with neural network models

M Bollmann - 2018 - hss-opus.ub.ruhr-uni-bochum.de
With the increasing availability of digitized resources of historical documents, interest in
effective natural language processing (NLP) for these documents is on the rise. However …

Corpus annotation

J Newman, C Cox - A practical handbook of corpus linguistics, 2021 - Springer
In this chapter, we provide an overview of the main concepts relating to corpus annotation,
along with some discussion of the practical aspects of creating annotated texts and working …

Summarising historical text in modern languages

X Peng, Y Zheng, C Lin, A Siddharthan - arxiv preprint arxiv:2101.10759, 2021 - arxiv.org
We introduce the task of historical text summarisation, where documents in historical forms
of a language are summarised in the corresponding modern language. This is a …

ANNIS: A graph-based query system for deeply annotated text corpora

T Krause - 2019 - edoc.hu-berlin.de
Diese Dissertation beschreibt das Design und die Implementierung eines effizienten
Suchsystems für linguistische Korpora. Das bestehende und auf einer relationalen …

Multi-task learning for historical text normalization: Size matters

M Bollmann, A Søgaard, J Bingel - Proceedings of the Workshop …, 2018 - aclanthology.org
Historical text normalization suffers from small datasets that exhibit high variance, and
previous work has shown that multi-task learning can be used to leverage data from related …

Abschnittsweise Analyse sprachlicher Flüssigkeit in der Lernersprache: Das Ganze ist weniger informativ als seine Teile

M Belz, C Odebrecht - Zeitschrift für germanistische Linguistik, 2022 - degruyter.com
In this corpus-based study we explore three measurements of L2 fluency–articulation rate,
filler particles, and pauses–, both within and between two registers of spontaneous …