Survey of post-OCR processing approaches

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

An OCR post-correction approach using deep learning for processing medical reports

S Karthikeyan, AGS de Herrera… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
According to a recent Deloitte study, the COVID-19 pandemic continues to place a huge
strain on the global health care sector. Covid-19 has also catalysed digital transformation …

Advancing post-OCR correction: A comparative study of synthetic data

S Guan, D Greene - arxiv preprint arxiv:2408.02253, 2024 - arxiv.org
This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …

[PDF][PDF] A hybrid solution for extracting information from unstructured data using optical character recognition (OCR) with natural language processing (NLP)

B Dash - Research Gate, 2021 - researchgate.net
With rapid digitalization, organizations are producing a lot of data as part of their day-to-day
operations. These data are stored either on their legacy platforms or in any cloud storage …

Correcting arabic soft spelling mistakes using bilstm-based machine learning

GA Abandah, A Suyyagh, MZ Khedher - arxiv preprint arxiv:2108.01141, 2021 - arxiv.org
Soft spelling errors are a class of spelling mistakes that is widespread among native Arabic
speakers and foreign learners alike. Some of these errors are typographical in nature. They …

Post-OCR Text Correction for Bulgarian Historical Documents

A Beshirov, M Dobreva, D Dimitrov, M Hardalov… - arxiv preprint arxiv …, 2024 - arxiv.org
The digitization of historical documents is crucial for preserving the cultural heritage of the
society. An important step in this process is converting scanned images to text using Optical …

Toward a period-specific optimized neural network for OCR error correction of historical Hebrew texts

O Suissa, M Zhitomirsky-Geffet… - ACM Journal on …, 2022 - dl.acm.org
Over the past few decades, large archives of paper-based historical documents, such as
books and newspapers, have been digitized using the Optical Character Recognition (OCR) …

A concise survey of OCR for low-resource languages

M Agarwal, A Anastasopoulos - … of the 4th Workshop on Natural …, 2024 - aclanthology.org
Modern natural language processing (NLP) techniques increasingly require substantial
amounts of data to train robust algorithms. Building such technologies for low-resource …

Synthetically Augmented Self-Supervised Fine-Tuning for Diverse Text OCR Correction

S Guan, D Greene - ECAI 2024, 2024 - ebooks.iospress.nl
Abstract The adoption of Optical Character Recognition (OCR) tools has been central to the
increased digitization of historical documents. However, the errors introduced during OCR …

Leveraging text repetitions and denoising autoencoders in OCR post-correction

K Hakala, A Vesanto, N Miekka, T Salakoski… - arxiv preprint arxiv …, 2019 - arxiv.org
A common approach for improving OCR quality is a post-processing step based on models
correcting misdetected characters and tokens. These models are typically trained on aligned …