ASR-models quantization

Motivation for launching the project by the client: “smart” devices are becoming more and more popular: watches, speakers, cameras, refrigerators. Usually, resources on such devices are limited, so researchers often face the task of neural network models optimization for specific machines, when it is necessary to simplify or modify the network architecture in order to reduce the size of the model and speed up its operation at the stage of application. However, the restrictions on the neural network can be more significant. For example, the apparat where the model is used can support calculations only in low-bit precision. In this case, the model needs to be quantized. Currently, on the one hand, quantization is an actively developing area. On the other hand, there are few works devoted to the quantization of neural network models based on transformers. The project requires to propose an honest method of low-bit quantization of the ASR transformer SOTA architecture without significant loss of quality of the final model on the public data set for training and validation of the LibriSpeech ASR models.

What we had initially:

ASR transformer model architectures and their implementations;
Methods for quantizing neural network models (not necessarily transformers); # nbsp;
Quantizing transformers is a relatively poorly explored area;
Sometimes researchers use dishonest quantization. Quantization honesty control is a separate issue requiring attention;
The use of "head-on" quantization greatly degrades the quality of the neural network;
Some modules in the model are more sensitive to quantization than others (eg Embedding layer, SoftMax).

Project goals: choose a strategy for the ASR model quantization based on transformers; If necessary, also make changes to the network architecture so that the quality of the quantized model (WER metric) is not much worse than the quality of the model with full accuracy (a decrease in quality of several per cent is allowed).

MIL Team's solution: take an available SOTA implementation of the ASR model architecture based on transformers and embed quantization into it. For successful and fair quantization, it is necessary, first, to implement quantized versions for all modules used inside the Torch model. Secondly, you need a convenient tool to replace the original Torch modules with quantized ones. Third, a series of experiments should be carried out to select the best quantization strategy.

Tools for building the model:

the ASR transformer architecture implemented in the Fairseq open repository, presented in the article Transformers with convolutional context for ASR.
a LibriSpeech dataset for training a speech recognition model, consisting of pairs of audio and text files with English speech.

The model results: under NDA

Client: under NDA

Technological stack: Python (PyTorch, torchaudio, Fairseq, SentencePiece)