Motivation for launching the project by the client: the customer's need for processing large volumes of documents revealed the following disadvantages: most of the available open solutions are too slow. In addition, there is no defined set of scenarios in which the solution stops producing acceptable OCR quality on a document.
What we had initially:
Project goals: create a toolkit to determine the best solution and the boundaries of its applicability.
MIL Team's solution: a set of tools has been created for testing TD + OCR solutions and efficiently collecting datasets consisted of documents in a “natural” environment. Using these tools, a team of 2 people within two weeks created a dataset of 1000 images with the selection of boxes of individual words on the page (you can count man-hours for n pages). The tools allow to highlight images in which solutions show a low accuracy and to attribute certain errors in the operation of algorithms to the image parameters (sheet rotation, lighting, shadows, coloured text and its background).