Evaluating the Impact of OCR Errors on Topic Modeling - Laboratoire Informatique, Image et Interaction Accéder directement au contenu
Chapitre D'ouvrage Année : 2018

Evaluating the Impact of OCR Errors on Topic Modeling

Stephen Mutuvi
  • Fonction : Auteur
  • PersonId : 1071922
Moses Odeo
Adam Jatowt

Résumé

Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.
Fichier principal
Vignette du fichier
Mutuvi2018_Chapter_EvaluatingTheImpactOfOCRErrors(1).pdf (344.9 Ko) Télécharger le fichier
Origine : Accord explicite pour ce dépôt

Dates et versions

hal-03025563 , version 1 (15-12-2020)

Identifiants

Citer

Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt. Evaluating the Impact of OCR Errors on Topic Modeling. Maturity and Innovation in Digital Libraries. 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, pp.3 - 14, 2018, ⟨10.1007/978-3-030-04257-8_1⟩. ⟨hal-03025563⟩

Collections

L3I UNIV-ROCHELLE
59 Consultations
383 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More