Skip to Main content Skip to Navigation
Book sections

Evaluating the Impact of OCR Errors on Topic Modeling

Abstract : Historical documents pose a challenge for character recognition due to various reasons such as font disparities across different materials, lack of orthographic standards where same words are spelled differently, material quality and unavailability of lexicons of known historical spelling variants. As a result, optical character recognition (OCR) of those documents often yield unsatisfactory OCR accuracy and render digital material only partially discoverable and the data they hold difficult to process. In this paper, we explore the impact of OCR errors on the identification of topics from a corpus comprising text from historical OCRed documents. Based on experiments performed on OCR text corpora, we observe that OCR noise negatively impacts the stability and coherence of topics generated by topic modeling algorithms and we quantify the strength of this impact.
Complete list of metadata
Contributor : Antoine Doucet Connect in order to contact the contributor
Submitted on : Tuesday, December 15, 2020 - 2:01:38 PM
Last modification on : Thursday, May 12, 2022 - 3:35:23 PM
Long-term archiving on: : Tuesday, March 16, 2021 - 6:03:42 PM


Explicit agreement for this submission




Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt. Evaluating the Impact of OCR Errors on Topic Modeling. Maturity and Innovation in Digital Libraries. 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, pp.3 - 14, 2018, ⟨10.1007/978-3-030-04257-8_1⟩. ⟨hal-03025563⟩



Record views


Files downloads