Comparison of the business criteria quality fulfilment in recognizing scanned documents

Motivation for launching the project by the client: the customer's need for processing large volumes of documents revealed the following disadvantages: most of the available open solutions are too slow. In addition, there is no defined set of scenarios in which the solution stops producing acceptable OCR quality on a document.

What we had initially:

a set of open source solutions for the OCR task;
a set of documents and presentations where the text should be recognized

Project goals: create a toolkit to determine the best solution and the boundaries of its applicability.

MIL Team's solution: a set of tools has been created for testing TD + OCR solutions and efficiently collecting datasets consisted of documents in a “natural” environment. Using these tools, a team of 2 people within two weeks created a dataset of 1000 images with the selection of boxes of individual words on the page (you can count man-hours for n pages). The tools allow to highlight images in which solutions show a low accuracy and to attribute certain errors in the operation of algorithms to the image parameters (sheet rotation, lighting, shadows, coloured text and its background).

Tools for building the model:

Dataset of electronic documents in pdf format provided by the customer;
Open source solutions for the TD + OCR problem (Tesseract, EasyOCR).

The model results:

Testing tools for TD + OCR solutions;
Five datasets of different "complexity" from photographs and scans of documents and presentations.

Client: ISP RAS
Technological stack: Python, OpenCV, Labelme