Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Published 27 Feb 2025 in cs.LG, cs.AI, cs.SD, and eess.AS | (2502.20583v1)

Abstract: Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

Summary

  • The paper proposes LiteASR, a low-rank approximation framework using PCA on calibration activations to compress ASR encoder layers by up to 50% with minimal WER impact.
  • The authors optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel for the encoder, achieving speedups of up to 1.57×.
  • Extensive experiments show that LiteASR generalizes across different languages and models, demonstrating favorable trade-offs between model size and accuracy.

This paper proposes a low-rank approximation framework for compressing ASR encoder layers to achieve significant efficiency gains with minimal impact on transcription accuracy.

  • The authors utilize PCA on calibration activations to decompose linear layers into low-rank matrix products, reducing encoder size by up to 50% while maintaining near-original WER performance in quality- and balanced-focused configurations.
  • They further optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel derived from FlashAttention, achieving encoder speedups of up to 1.57×.
  • Extensive experiments demonstrate that the approach generalizes across different languages and models, with detailed sensitivity analyses confirming favorable trade-offs between model size and accuracy.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 41 likes about this paper.