Actions
  • shareshare
  • link
  • cite
  • add
add
Other research product . 2021

A web platform for collaborative semi-automatic OCR post-processing

Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.;
Open Access
English
Published: 01 Jan 2021
Country: Argentina
Abstract

Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.

Sociedad Argentina de Informática e Investigación Operativa

Subjects

Ciencias Informáticas, OCR Post-processing, Digital Humanities, Language Models

Related Organizations
Related to Research communities
Digital Humanities and Cultural Heritage
moresidebar