LoRA-Finetuned Whisper ASR

Updated 13 November 2025

LoRA-Finetuned Whisper is an approach that integrates low-rank adapters into frozen Whisper models to achieve efficient and multilingual automatic speech recognition.
The method reduces trainable parameters to roughly 5% of the full model while lowering the average WER from 11.68% to 9.52% in multilingual settings.
It optimizes performance by freezing base weights and training only adapter modules in attention and feed-forward layers using an AdamW optimizer at a learning rate of 1e-4.

Low-Rank Adaptation (LoRA)-finetuned Whisper denotes a series of methods and empirical studies for parameter-efficient, language-extensible, and robust fine-tuning of OpenAI’s Whisper automatic speech recognition (ASR) models via low-rank adapters. Collectively, these works establish LoRA as a principal mechanism for scalable, interference-free multilingual ASR with strong performance and minimal resource overhead, supported by rigorous experimentation and established mathematical formulations.

1. Whisper Backbone and LoRA Integration

Whisper is an encoder–decoder Transformer ASR architecture, typically consisting of:

An encoder (e.g., 12 layers for Whisper-small) with multi-head self-attention and feed-forward sublayers, ingesting 80-dimensional log-Mel spectrogram frames.
A decoder (e.g., 6 layers for Whisper-small), employing masked self-attention, cross-attention over encoder states, and subsequent feed-forward layers.

Each Transformer block contains learnable projection matrices: queries, keys, values (Wq, Wk, Wv), output projection (Wo), and feed-forward matrices (W_fc).

LoRA augments the frozen Whisper parameters by inserting trainable low-rank adapters into specific weight matrices. For each targeted matrix $W\in\mathbb{R}^{d_1\times d_2}$ , LoRA replaces the forward mapping:

$f(x) = (W + \Delta W)x + b$

with the low-rank update

$\Delta W = B\,A,\quad B\in\mathbb{R}^{d_1\times r},\,A\in\mathbb{R}^{r\times d_2},\,r\ll\min(d_1,d_2),$

yielding

$W'x = Wx + B(Ax),$

where the original $W$ is frozen and only $A$ and $B$ are optimized.

In practice, LoRA adapters are attached to:

The three projection matrices in self-attention and cross-attention: Wq, Wk, Wv.
The first feed-forward projection, W_fc.

The LoRA rank $r$ determines adapter capacity, typically set between 16 to 48; $r=32$ is found optimal for Whisper-small. Total adapter parameter count per language is approximately 13M parameters ( $\approx$ 5% of full model for Whisper-small) for $r=32$ (Song et al., 7 Jun 2024).

2. Training Methodology and Optimization

Training a LoRA-finetuned Whisper model involves:

A standard autoregressive cross-entropy loss over Whisper’s output token sequence. No CTC loss is used.
Optimization via AdamW (β₁=0.9, β₂=0.999, weight decay=0.01), with a peak learning rate of $1\times10^{-4}$ for LoRA modules and zero learning rate for frozen base parameters.
Typical hardware: 2 × RTX3090 GPUs; batch size commonly 16–32 per GPU for Whisper-small.
Epoch count: 10 for LoRA-Whisper multilingual experiments (Song et al., 7 Jun 2024).

Multilingual training does not require explicit curriculum or temperature-based sampling. Each batch contains a single language’s data, and only the corresponding LoRA module is activated for forward and backward passes.

3. Parameter-Efficiency, Extensibility, and Language Expansion

LoRA delivers substantial parameter savings over conventional full-model fine-tuning:

Finetune Style	# Trainable Params	Avg WER (%)
Full multi-lang	240M	11.68
Full mono-lang	960M	9.19
LoRA-Whisper	52M (4×13M)	9.52

Efficient language addition is enabled by LoRA’s modularity:

Adding a new language $\ell_\text{new}$ requires instantiating a fresh LoRA module ( $\sim$ 13M parameters), training only its adapter on $\ell_\text{new}$ data while freezing Whisper and existing adapters.
Base languages’ performance remains stable, with zero catastrophic forgetting.

Initialization of new-Language LoRA modules is best performed by copying from an existing base-language LoRA (identified via Whisper’s built-in LID), or by interpolating between two base experts for an additional 5% performance gain (Song et al., 7 Jun 2024).

4. Empirical Results: Multilingual ASR and Language Expansion

LoRA-Whisper achieves strong empirical results:

Multilingual ASR (PL, PT, IT, ZH): Reduces average WER from 11.68% (full multilingual) to 9.52%, representing an 18.5% relative gain. Matches closely to monolingual fine-tune oracle (9.19%) at only 5% of the trainable parameter cost.
Language Expansion (DA, EL, CY, JA): Training LoRA adapters exclusively on new languages yields 22.97% WER versus 28.33% for full fine-tuning, with base languages staying at 9.52% WER. This constitutes a 23.0% relative improvement over full fine-tuning for new-language adaptation (Song et al., 7 Jun 2024).

Ablation on LoRA rank demonstrates a clear optimality at $r=32$ for Polish ASR: WER decreases to 7.94% at $r=32$ , but does not improve further for higher ranks, which only increase parameter count (Song et al., 7 Jun 2024).

5. Practical Implementation Details and Best Practices

Implementation of LoRA-Whisper is straightforward in any PyTorch-style Transformer framework. The HuggingFace peft library or a custom wrapper for Linear layers is sufficient.

Illustrative PyTorch code for a LoRA adapter:

class LoRALinear(nn.Module):
    def __init__(self, W, r=32):
        super().__init__()
        self.W = W  # frozen nn.Linear
        self.A = nn.Parameter(torch.zeros(r, W.in_features))
        self.B = nn.Parameter(torch.zeros(W.out_features, r))
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        nn.init.zeros_(self.B)
    def forward(self, x):
        return self.W(x) + self.B @ (self.A @ x)

Recommended practices include:

Setting LoRA rank $r$ in the optimal range (16–48, with $r=32$ for Whisper-small).
Inserting adapters into attention projections (Q,K,V) and the first FFN layer per block.
Freezing the base Whisper model and updating only LoRA parameters at high learning rate (1e-4).
Warm-starting new-language LoRA modules from the most similar existing base adapter.
For expansion, mix two base experts using a soft-gating MoE for optimal performance.

6. Theoretical and Practical Impact

LoRA-Whisper’s plug-and-play modularity establishes a scalable paradigm for continual language expansion without rehearsal or forgetting, making it optimal for multilingual ASR deployment. It greatly reduces compute and memory footprints by restricting finetuning to low-rank adapters, yet achieves performance competitive with full fine-tuning.

LoRA adapters allow for:

Near-oracle monolingual performance with only a fraction of parameters.
Robust addition of languages, critical for real-world ASR deployments in dynamically evolving linguistic environments.
Strong mitigation of language interference, a central challenge in multilingual ASR training.

These contributions establish LoRA-Whisper as a highly extensible and interference-free solution for multilingual recognition in both research and production settings (Song et al., 7 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR (2024)

Follow Topic

Get notified by email when new papers are published related to LoRA-Finetuned Whisper.