Recognition of complex boundless tables on scanned documents

Motivation for launching the project by the client: the task was to convert a number of documents with tables from an image to an electronic form.

What we had initially:

the main difficulties in recognizing tables: restoring the original structure of the table and working with invisible cell borders.
classical CV methods do not work well for tables with invisible borders.

Project goals: the creation of a module for converting a table image into a structured format.

MIL Team's solution: segmentation model of table cells with subsequent post-processing.

Tools for building the model:

Dataset PubTabNet
crowdsourcing Ya. Toloka.

The model results: a model based on the Unet architecture and an accompanying environment that allows converting a scanned table to HTML format.

Client: ISP RAS

Technological stack: Python, PyTorch, OpenCV