Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
15 tokens/sec
GPT-5 High Premium
23 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
201 tokens/sec
2000 character limit reached

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (2502.20583v1)

Published 27 Feb 2025 in cs.LG, cs.AI, cs.SD, and eess.AS

Abstract: Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

Summary

  • The paper proposes LiteASR, a low-rank approximation framework using PCA on calibration activations to compress ASR encoder layers by up to 50% with minimal WER impact.
  • The authors optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel for the encoder, achieving speedups of up to 1.57×.
  • Extensive experiments show that LiteASR generalizes across different languages and models, demonstrating favorable trade-offs between model size and accuracy.

This paper proposes a low-rank approximation framework for compressing ASR encoder layers to achieve significant efficiency gains with minimal impact on transcription accuracy.

  • The authors utilize PCA on calibration activations to decompose linear layers into low-rank matrix products, reducing encoder size by up to 50% while maintaining near-original WER performance in quality- and balanced-focused configurations.
  • They further optimize self-attention by operating in reduced dimensions and implement a specialized Triton kernel derived from FlashAttention, achieving encoder speedups of up to 1.57×.
  • Extensive experiments demonstrate that the approach generalizes across different languages and models, with detailed sensitivity analyses confirming favorable trade-offs between model size and accuracy.
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com