Motivation for launching the project by the client: the customer's need for processing large volumes of documents revealed the following disadvantages: most of the available open solutions are too slow. In addition, there is no defined set of scenarios in which the solution stops producing acceptable OCR quality on a document.
What we had initially:
- a set of open source solutions for the OCR task;
- a set of documents and presentations where the text should be recognized
Project goals: create a toolkit to determine the best solution and the boundaries of its applicability.
MIL Team's solution: a set of tools has been created for testing TD + OCR solutions and efficiently collecting datasets consisted of documents in a “natural” environment. Using these tools, a team of 2 people within two weeks created a dataset of 1000 images with the selection of boxes of individual words on the page (you can count man-hours for n pages). The tools allow to highlight images in which solutions show a low accuracy and to attribute certain errors in the operation of algorithms to the image parameters (sheet rotation, lighting, shadows, coloured text and its background).
Tools for building the model:
- Dataset of electronic documents in pdf format provided by the customer;
- Open source solutions for the TD + OCR problem (Tesseract, EasyOCR).
The model results:
- Testing tools for TD + OCR solutions;
- Five datasets of different "complexity" from photographs and scans of documents and presentations.
Client: ISP RAS
Technological stack: Python, OpenCV, Labelme