Multilingual topic model for article search

Motivation for launching the project by the client: the client wanted to add new functionality to his own product - the ability to search for a translation of a scientific article among the most common languages.

What we had initially: Antiplagiat did not have functionality for searching translations of scientific articles; the task was to add new functionality.

Project goals: to build a topic model aimed to solve two problems: the problem of semantic search for the translation of scientific articles, as well as the problem of classifying scientific articles relative to scientific headings.

MIL Team's solution: The team's experience in the field of topic modelling and microservice architecture made it possible to create a service for searching translations of scientific articles and defining scientific headings of articles, which can be launched in a virtual machine.

Tools for building the model:

A parallel corpus of scientific articles from the library website;
A parallel corpus of Wikipedia articles in 100 languages;
Affiliation tags of scientific headings of different rubricators(UDC, OECD).

The model results:

a topic model of scientific headings;
a virtual machine on which the model can run.

Client: Antiplagiat
Technological stack: grpc, Python, sklearn, BigARTM