Skip to Main content Skip to Navigation
Book sections

Logical Structure Extraction from Digitized Books

Abstract : Mass digitization projects, such as the Million Book Project, efforts of the Open Content Alliance, and the digitization work of Google, are converting whole libraries by digitizing books on an industrial scale [5]. The process involves the efficient photographing of books, page-by-page, and the conversion of the image of each page into searchable text through the use of optical character recognition (OCR) software. Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognized. In order to enable systems to provide users with richer browsing experiences, it is necessary to make such additional structures available, for example, in the form of XML markup embedded in the full text of the digitized books. The Book Structure Extraction competition aims to address this need by promoting research into automatic structure recognition and extraction techniques that could complement or enhance current OCR methods and Document Analysis and Text Recognition Downloaded from by UNIVERSITY OF HELSINKI on 11/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Complete list of metadata
Contributor : Antoine Doucet Connect in order to contact the contributor
Submitted on : Tuesday, December 15, 2020 - 2:04:39 PM
Last modification on : Thursday, May 12, 2022 - 3:37:13 PM
Long-term archiving on: : Tuesday, March 16, 2021 - 6:04:03 PM


Explicit agreement for this submission




Antoine Doucet. Logical Structure Extraction from Digitized Books. Document Analysis and Text Recognition Benchmarking State-of-the-Art Systems, pp.3-28, 2018, ⟨10.1142/9789813229273_0001⟩. ⟨hal-03025598⟩



Record views


Files downloads