LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Published 27 Feb 2025 in cs.LG, cs.AI, cs.SD, and eess.AS | (2502.20583v1)

Abstract: Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes LiteASR, a low-rank approximation framework using PCA on calibration activations to compress ASR encoder layers by up to 50% with minimal WER impact.
The authors optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel for the encoder, achieving speedups of up to 1.57×.
Extensive experiments show that LiteASR generalizes across different languages and models, demonstrating favorable trade-offs between model size and accuracy.

This paper proposes a low-rank approximation framework for compressing ASR encoder layers to achieve significant efficiency gains with minimal impact on transcription accuracy.

The authors utilize PCA on calibration activations to decompose linear layers into low-rank matrix products, reducing encoder size by up to 50% while maintaining near-original WER performance in quality- and balanced-focused configurations.
They further optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel derived from FlashAttention, achieving encoder speedups of up to 1.57×.
Extensive experiments demonstrate that the approach generalizes across different languages and models, with detailed sensitivity analyses confirming favorable trade-offs between model size and accuracy.