Description of the problem
BANATICA collection gathers together all the printed products considered monographs (brochures, books, yearbooks, calendars in volumes, prints with an individual cover, atlases, book-like printed scores etc.) which represent documentation sources for the culture and civilization on the Banat region.
This collection was jointly created by the VI-SEEM partner, Central University Library “Eugen Todoran” Timisoara (Romania), and “Zarko Zrenjanin” public library (Serbia) throughout the Biblio-Ident IPA funded project. The collection comprises over 1000 bibliographic descriptions and 200 full-text scanned books. Although digitization of resources was accomplished, BCUT partner still needed help to find effective channels to expose the collection to a larger scientific audience within the region and to transform the collection into a machine-processable one.
To help BCUT partner address these challenges, the Banatica Virtual Library (BVL) project has been setup within VI-SEEM’s DCH community.
Results and future work
The goal of the project is two-fold: to make the collection widely available and easily accessible for a larger scientific audience in the region and to make it machine-processable. Leveraging DCH platform setup in VI-SEEM we published there the entire collection organized into five datasets (Fig 1): two containing the covers of the publications from Banatica collection, two datasets each of each containing 100 full book scans (one for books owned by BCUT and another one for books hosted by Public Library Zarko Zrenjanin), plus an additional dataset comprising books’ table of content. During the upload process, metadata was added to the documents to enable machines to understand and process the content.
In order to make the content machine readable, we setup an environment to run optical character recognition (OCR) against the documents of the collection in order to extract relevant keywords and full text blocks. This would enable further processes to be run against the collection, such as searching, indexing and mining cultural heritage.
In order to enhance the OCR performance, a pipeline was created to process the scanned publications. In the first stage of processing, noise has been removed from the documents, and image quality has been sharpened. The second stage is devoted to the OCR process itself, where we use an open source engine . Code and scripts behind this OCR pipeline are openly available with VI-SEEM’s code repository .
Running the process is extremely CPU intensive as the initial benchmark ran against the dataset of 200 digitized prints took ~ 4 days (on a dedicated VM). In order to speed up the process (Fig. 2), the pipeline has been replicated on multiple machine, each getting a subset of PDF documents in a round robin fashion, a master being responsible for distributing the work to multiple workers that run the processing pipeline.
Banatica Virtual Library makes rare and old publications accessible again for a wider scientific audience in SEE region and enables further machine processing to be applied on the entire collection.