Motivation for launching the project by the client: “smart” devices are becoming more and more popular: watches, speakers, cameras, refrigerators. Usually, resources on such devices are limited, so researchers often face the task of neural network models optimization for specific machines, when it is necessary to simplify or modify the network architecture in order to reduce the size of the model and speed up its operation at the stage of application. However, the restrictions on the neural network can be more significant. For example, the apparat where the model is used can support calculations only in low-bit precision. In this case, the model needs to be quantized. Currently, on the one hand, quantization is an actively developing area. On the other hand, there are few works devoted to the quantization of neural network models based on transformers. The project requires to propose an honest method of low-bit quantization of the ASR transformer SOTA architecture without significant loss of quality of the final model on the public data set for training and validation of the LibriSpeech ASR models.
Project goals: choose a strategy for the ASR model quantization based on transformers; If necessary, also make changes to the network architecture so that the quality of the quantized model (WER metric) is not much worse than the quality of the model with full accuracy (a decrease in quality of several per cent is allowed).
MIL Team's solution: take an available SOTA implementation of the ASR model architecture based on transformers and embed quantization into it. For successful and fair quantization, it is necessary, first, to implement quantized versions for all modules used inside the Torch model. Secondly, you need a convenient tool to replace the original Torch modules with quantized ones. Third, a series of experiments should be carried out to select the best quantization strategy.
Tools for building the model:
Client: under NDA
Technological stack: Python (PyTorch, torchaudio, Fairseq, SentencePiece)