research product . Other ORP type . 2020

Fusus: a workflow to transform Arabic classical works in printed form to structured text

Roorda, Dirk; van Lit, Cornelis;
Restricted English
  • Published: 07 Dec 2020
  • Publisher: Zenodo
  • Country: Netherlands
Abstract
# Fusus This is a workflow that transforms scanned pages into readable text. The pages come from several printed Arabic books from the past few centuries. The workflow takes care of cleaning, OCR and postprocessing. A user can copy and paste image fragments of specks and symbols that must be removed before doing OCR. The workflow detects column layout and line boundaries. Individual lines will be passed to the OCR engine, which is Kraken using a model trained on many printed Arabic books. See [model](https://among.github.io/fusus/about/model.html). The result is stored in tab-separated files, with the transcription computed by the OCR step, plus position and con...
Subjects
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
free text keywords: arabic, ocr, workflow, text-processing, image-processing, python, kraken, opencv, text-fabric, digital humanities, wisdom
Related Organizations
Communities
Social Science and Humanities
Digital Humanities and Cultural Heritage
Download from
KNAW Repository
Other ORP type . 2020
Provider: NARCIS
Any information missing or wrong?Report an Issue