Low-Bit Quantization of Transformer for Audio Speech Recognition

Name: Low-Bit Quantization of Transformer for Audio Speech Recognition

Journal: Advances in Neural Computation, Machine Learning, and Cognitive Research VI

Authors: Zharikov Ilia, Gleb Odinokikh , Ivan Krivorotov, Vasily Alexeev, Alexander Alexeev

Abstract: The automatic speech recognition is a challenging deep learning problem and transformer architectures have gained an immense improvement in the performance on that task. However, transformer-based models are computationally expensive and comparatively large, which creates issues on deploying them on the memory-constrained devices. Quantization is one of the most promising approaches in reducing the neural network’s size and latency. In this paper, we mainly focus on the optimization of the ASR transformer model by applying quantization and knowledge distillation. We apply the SotA quantization methods on the baseline ASR model and examine the sensitive layers which make significant contribution to the performance drop. We’ve come up with the improvements to accelerate the convergence of quantization methods and to enhance the quantization representation quality. Our modified 2-bit model has shown less than 1% drop in WER in comparison to the float model on the LibriSpeech dataset.

Link: Low-Bit Quantization of Transformer for Audio Speech Recognition