Success Story - en

Comparison of the business criteria quality fulfilment in recognizing scanned documents

Motivation for launching the project by the client: the customer's need for processing large volumes of documents revealed the following disadvantages: most of the available open solutions are too slow. In addition, there is no defined set of scenarios in which the solution stops producing acceptable OCR quality on a document.

What we had initially: 

  • a set of open source solutions for the OCR task;
  • a set of documents and presentations where the text should be recognized

Project goals: create a toolkit to determine the best solution and the boundaries of its applicability.

MIL Team's solution: a set of tools has been created for testing TD + OCR solutions and efficiently collecting datasets consisted of documents in a “natural” environment. Using these tools, a team of 2 people within two weeks created a dataset of 1000 images with the selection of boxes of individual words on the page (you can count man-hours for n pages). The tools allow to highlight images in which solutions show a low accuracy and to attribute certain errors in the operation of algorithms to the image parameters (sheet rotation, lighting, shadows, coloured text and its background).

Tools for building the model:
  • Dataset of electronic documents in pdf format provided by the customer;
  • Open source solutions for the TD + OCR problem (Tesseract, EasyOCR).

The model results: 
  • Testing tools for TD + OCR solutions;
  • Five datasets of different "complexity" from photographs and scans of documents and presentations.

Client: ISP RAS
Technological stack: Python, OpenCV, Labelme
Computer Vision Research Division Engineering Division