Evaluating and mitigating the impact of OCR errors on information retrieval

LL de Oliveira, DS Vargas, AMA Alexandre… - International Journal on …, 2023 - Springer
Optical character recognition (OCR) is typically used to extract the textual contents of
scanned texts. The output of OCR can be noisy, especially when the quality of the scanned …

Advancing post-OCR correction: A comparative study of synthetic data

S Guan, D Greene - arxiv preprint arxiv:2408.02253, 2024 - arxiv.org
This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …

Leveraging open large language models for historical named entity recognition

CE González-Gallardo, HTH Tran, A Hamdi… - … Conference on Theory …, 2024 - Springer
The efficacy of large-scale language models (LLMs) as few-shot learners has dominated the
field of natural language processing, achieving state-of-the-art performance in most tasks …

Injecting temporal-aware knowledge in historical named entity recognition

CE González-Gallardo, E Boros, E Giamphy… - … on Information Retrieval, 2023 - Springer
In this paper, we address the detection of named entities in multilingual historical collections.
We argue that, besides the multiple challenges that depend on the quality of digitization (eg …

Archive timeline summarization (atls): conceptual framework for timeline generation over historical document collections

N Gutehrlé, A Doucet, A Jatowt - arxiv preprint arxiv:2301.13479, 2023 - arxiv.org
Archive collections are nowadays mostly available through search engines interfaces, which
allow a user to retrieve documents by issuing queries. The study of these collections may be …

Confidence-Aware Document OCR Error Detection

A Hemmer, M Coustaty, N Bartolo, JM Ogier - International Workshop on …, 2024 - Springer
Abstract Optical Character Recognition (OCR) continues to face accuracy challenges that
impact subsequent applications. To address these errors, we explore the utility of OCR …

The digitization of historical astrophysical literature with highly localized figures and figure captions

JP Naiman, PKG Williams, A Goodman - International Journal on Digital …, 2024 - Springer
Scientific articles published prior to the “age of digitization” in the late 1990s contain figures
which are “trapped” within their scanned pages. While progress to extract figures and their …

Exploring the capabilities of gpt4-vision as ocr engine

A Ghiriti, W Göderle, R Kern - … Conference on Theory and Practice of …, 2024 - Springer
Many museums and libraries conducted efforts to digitize their assets, and many historic
documents are now available as digital images. However, these documents are not directly …

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

D Fleischhacker, W Goederle, R Kern - arxiv preprint arxiv:2401.07787, 2024 - arxiv.org
This paper addresses a major challenge to historical research on the 19th century. Large
quantities of sources have become digitally available for the first time, while extraction …

Large Synthetic Data from the arxiv for OCR Post Correction of Historic Scientific Articles

JP Naiman, MG Cosillo, PKG Williams… - arxiv preprint arxiv …, 2023 - arxiv.org
Scientific articles published prior to the" age of digitization"(~ 1997) require Optical
Character Recognition (OCR) to transform scanned documents into machine-readable text …