Evaluating and mitigating the impact of OCR errors on information retrieval
LL de Oliveira, DS Vargas, AMA Alexandre… - International Journal on …, 2023 - Springer
Optical character recognition (OCR) is typically used to extract the textual contents of
scanned texts. The output of OCR can be noisy, especially when the quality of the scanned …
scanned texts. The output of OCR can be noisy, especially when the quality of the scanned …
Advancing post-OCR correction: A comparative study of synthetic data
This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …
fronts by conducting experiments to assess the impact of data volume, augmentation, and …
Leveraging open large language models for historical named entity recognition
The efficacy of large-scale language models (LLMs) as few-shot learners has dominated the
field of natural language processing, achieving state-of-the-art performance in most tasks …
field of natural language processing, achieving state-of-the-art performance in most tasks …
Injecting temporal-aware knowledge in historical named entity recognition
CE González-Gallardo, E Boros, E Giamphy… - … on Information Retrieval, 2023 - Springer
In this paper, we address the detection of named entities in multilingual historical collections.
We argue that, besides the multiple challenges that depend on the quality of digitization (eg …
We argue that, besides the multiple challenges that depend on the quality of digitization (eg …
Archive timeline summarization (atls): conceptual framework for timeline generation over historical document collections
Archive collections are nowadays mostly available through search engines interfaces, which
allow a user to retrieve documents by issuing queries. The study of these collections may be …
allow a user to retrieve documents by issuing queries. The study of these collections may be …
Confidence-Aware Document OCR Error Detection
Abstract Optical Character Recognition (OCR) continues to face accuracy challenges that
impact subsequent applications. To address these errors, we explore the utility of OCR …
impact subsequent applications. To address these errors, we explore the utility of OCR …
The digitization of historical astrophysical literature with highly localized figures and figure captions
Scientific articles published prior to the “age of digitization” in the late 1990s contain figures
which are “trapped” within their scanned pages. While progress to extract figures and their …
which are “trapped” within their scanned pages. While progress to extract figures and their …
Exploring the capabilities of gpt4-vision as ocr engine
Many museums and libraries conducted efforts to digitize their assets, and many historic
documents are now available as digital images. However, these documents are not directly …
documents are now available as digital images. However, these documents are not directly …
Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach
This paper addresses a major challenge to historical research on the 19th century. Large
quantities of sources have become digitally available for the first time, while extraction …
quantities of sources have become digitally available for the first time, while extraction …
Large Synthetic Data from the arxiv for OCR Post Correction of Historic Scientific Articles
JP Naiman, MG Cosillo, PKG Williams… - arxiv preprint arxiv …, 2023 - arxiv.org
Scientific articles published prior to the" age of digitization"(~ 1997) require Optical
Character Recognition (OCR) to transform scanned documents into machine-readable text …
Character Recognition (OCR) to transform scanned documents into machine-readable text …