Prompt Expansion in Multilingual ASR
- Prompt Expansion is a method that uses trainable soft prompts to expand a frozen Whisper model to support new languages without full model retraining.
- It integrates encoder-decoder prompt injection with language-aware construction to effectively reduce cross-language interference and avoid catastrophic forgetting.
- Evaluations on multilingual ASR tasks demonstrate that prompt expansion achieves competitive accuracy with dramatically fewer trainable parameters compared to full fine-tuning techniques.
Searching arXiv for recent and related work on multilingual ASR prompt tuning, Whisper, and PEFT for language expansion. Language-aware prompt tuning for multilingual automatic speech recognition is a parameter-efficient method for expanding a frozen Whisper model to previously unseen languages by adding trainable soft prompts rather than re-training the backbone. In this setting, “prompt expansion” denotes expansion of language coverage through extra prompt parameters, with the base model weights kept frozen or largely frozen. The approach introduced in “Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR” combines encoder–decoder soft prompt injection, language-aware prompt construction, and a continual-learning toolkit intended to reduce language interference, avoid catastrophic forgetting, and keep computational overhead low during multilingual ASR expansion (Yang et al., 16 Jun 2025).
1. Problem setting and scope
Multilingual ASR systems such as Whisper are organized as all-in-one models shared across many languages. The paper identifies two persistent difficulties in that regime: language interference, where shared parameters can induce cross-language confusion, and language expansion to unseen languages, where adding a new language by updating the full model is both expensive and vulnerable to catastrophic forgetting (Yang et al., 16 Jun 2025).
The work frames language expansion as a continual-learning problem under stringent efficiency constraints. Full Fine-Tuning updates hundreds of millions of parameters and may degrade performance on previously supported languages. Continual-learning schemes based on replay or regularization add further complexity and may require retaining past data. By contrast, Parameter-Efficient Fine-Tuning methods modify only a small subset of parameters or add small modules, allowing the base model to remain frozen while allocating separate adaptation parameters to new languages. Within that family, Soft Prompt Tuning is treated as especially suitable because language-specific information is encoded as continuous learnable prompt vectors, and each new language can receive its own added prompt head without overwriting existing ones (Yang et al., 16 Jun 2025).
The paper’s objective is therefore narrowly defined: expand a multilingual Whisper model to unsupported languages while keeping additional parameters small, preserving prior multilingual knowledge, and reducing language interference through explicitly language-aware prompt mechanisms. This suggests a modular view of multilingual ASR in which language growth is achieved by enlarging a prompt library rather than by revising the backbone itself.
2. Whisper backbone and prompt insertion scheme
The underlying model is Whisper, an encoder–decoder Transformer. The encoder consumes acoustic features , where is sequence length in frames and is the embedding dimension, and produces hidden states summarizing the audio. The decoder is a Transformer LLM conditioned on encoder outputs and on a sequence of special tokens such as start, language, and task markers (Yang et al., 16 Jun 2025).
The core modification is external to Whisper’s internal layers. On the encoder side, the method prepends a learnable soft prompt matrix to the acoustic feature sequence, yielding
On the decoder side, it inserts a learnable prompt matrix where Whisper normally uses the ⟨Prev⟩ special token, producing an input of the form
where denotes the embedded special-token sequence and the paper notes four special tokens in this construction (Yang et al., 16 Jun 2025).
Because these prompt vectors live in the same embedding space as model inputs, the backbone Transformer and its attention mechanisms remain unchanged. The prompts simply appear as additional tokens in self-attention and cross-attention. This architectural choice is central to the paper’s notion of seamless expansion: the model is adapted by modifying only the inputs presented to frozen Transformer layers, not the layers themselves.
3. Entire Soft Prompt Tuning
The paper distinguishes several prompt-placement regimes: encoder-only SPT, decoder-only SPT, and Entire Soft Prompt Tuning, which attaches prompts to both encoder and decoder. A soft prompt is defined here as a sequence of continuous learnable vectors optimized by gradient descent and not drawn from the vocabulary. In the proposed setup, each language receives its own encoder prompt and decoder prompt while Whisper weights remain frozen (Yang et al., 16 Jun 2025).
Entire SPT is motivated by the observation that multilingual ASR adaptation involves both acoustic representation and sequence generation. Decoder-only prompting affects language modeling and decoding; encoder-side prompting can also modulate acoustic feature extraction. The empirical comparison on Asturian with Whisper-small shows that this broader intervention is beneficial, particularly at larger prompt lengths.
| Prompt length | Decoder SPT CER | Entire SPT CER |
|---|---|---|
| 16 | 11.61 | 11.77 |
| 32 | 11.48 | 11.61 |
| 64 | 11.41 | 10.82 |
| 128 | 11.91 | 10.31 |
Encoder-only SPT is consistently worse than decoder-only SPT in the reported ablation, while Entire SPT becomes best at prompt length 64 and especially 128. The paper also reports an encoder-only CER of 12.33 at prompt length 128, reinforcing that prompt placement materially affects expansion quality. Decoder length constraints prevent a direct 256-token decoder comparison; for Entire SPT at that setting, the paper uses 256 encoder tokens and 128 decoder tokens, with CER 10.57 (Yang et al., 16 Jun 2025).
The abstract reports that in language expansion tasks, Entire SPT outperforms Decoder SPT by 5.0%. The reported ablations support the same direction of effect: prompting both halves of the encoder–decoder model yields stronger adaptation than restricting prompting to the decoder alone. A plausible implication is that language expansion in speech models is not reducible to decoder-side lexical adaptation; the encoder must also be nudged toward language-dependent acoustic structure.
4. Language-Aware Prompt Tuning
Entire SPT gives each language its own prompts, but by itself it does not explicitly encode cross-lingual structure. Language-Aware Prompt Tuning adds a second layer of organization by exploiting Whisper’s language identification and pretrained language embeddings to derive shared and language-specific prompt components (Yang et al., 16 Jun 2025).
The first stage is similarity estimation for a new language. The method samples 0 audio segments, runs Whisper language identification on each one, and obtains probability vectors
1
over 2 base languages. Similarity to base language 3 is defined as
4
and the most similar base language is selected from the largest 5. This most-similar language is then used to initialize or guide prompts for the new language (Yang et al., 16 Jun 2025).
The second stage introduces language prompts derived from Whisper’s pretrained language token embeddings via a prompt encoder. Although the paper does not provide an explicit equation for the prompt encoder, it states that language embeddings are transformed into prompt vectors compatible with acoustic dimensions and then used together with learnable soft prompts. Two variants follow from this construction. In Shared Language Prompt Tuning, a common soft prompt is reused across languages and combined with language-aware conditioning; parameter growth does not scale linearly with the number of languages. In Separate Language Prompt Tuning, each language has its own dedicated prompt matrices, and only the prompt corresponding to the input language is activated (Yang et al., 16 Jun 2025).
This shared-versus-separate decomposition is the paper’s core language-aware idea. Shared prompts encode cross-lingual commonality; separate prompts or language-conditioned components capture language-specific nuances. The abstract reports that LAPT outperforms Decoder SPT by 16.0% in language expansion tasks. The reported experimental tables further show that adding LAPT on top of Entire SPT yields consistent improvements, suggesting that the gains are not merely a consequence of prompt existence but of how prompt space is organized around language similarity.
5. Toolkit, experiments, and empirical performance
The implementation vehicle is SPT-Whisper, an open-source toolkit integrating prompt-based PEFT methods into Whisper, including P-Tuning v2, Residual Prompt Tuning (ResMLP), LoPT, Entire SPT, and LAPT. It is described as a practical continual-learning framework in which the same Whisper backbone is reused while language-specific prompt modules are trained, stored, and loaded independently (Yang et al., 16 Jun 2025).
The main experiments use the FLEURS benchmark. Although FLEURS covers 102 languages, the paper focuses on three languages from the set unsupported by Whisper: Asturian (6), Sorani Kurdish (7), and Kabuverdianu (8). Each has roughly 10–12 hours of training data, with utterances shorter than 30 seconds. Evaluation is by Character Error Rate, using Whisper-small and Whisper-medium backbones, 20 epochs, batch size 8, greedy search decoding, and a single NVIDIA RTX3090 with 24GB VRAM. FFT uses initial learning rate 9 with linear decay; LoRA and SPT-based methods use 0. LoRA is applied to attention layers with rank 8 (Yang et al., 16 Jun 2025).
For Whisper-small, the average CER across the three unseen languages is summarized below.
| Method | Avg. CER | Trainable parameters |
|---|---|---|
| Baseline Whisper-small | 36.10% | — |
| FFT | 8.87% | 240.58M |
| LoRA | 12.93% | 1M 2 |
| Shared Entire SPT | 12.43% | 0.17M |
| Shared Entire SPT + LAPT | 12.34% | 0.96M |
| Separate Entire SPT | 12.54% | 0.17M 3 |
| Separate Entire SPT + LAPT | 12.07% | 0.96M 4 |
FFT remains the strongest in raw CER, but at the cost of full-model updating. The prompt-based methods operate with orders of magnitude fewer trainable parameters. On Whisper-small, Shared Entire SPT requires only 0.17M trainable parameters, and the separate LAPT variant achieves the best prompt-based CER at 12.07%. The paper reports analogous trends for Whisper-medium: full fine-tuning is best but expensive, whereas Entire SPT plus LAPT offers the strongest accuracy–efficiency trade-off among PEFT options (Yang et al., 16 Jun 2025).
These results support the central practical claim: multilingual language expansion can be realized as prompt-library growth with minimal computational overhead. The paper also argues that because the base Whisper weights and previously learned prompt sets are not modified when adding a new language, catastrophic forgetting on earlier languages is avoided by construction.
6. Comparisons, interpretation, and significance
The comparison space in the paper includes Full Fine-Tuning, LoRA, decoder-only prompt tuning, encoder-only prompt tuning, Shared Entire SPT, Separate Entire SPT, and their language-aware extensions. Several patterns recur. First, FFT delivers the lowest CER but with maximal parameter and compute cost. Second, LoRA is a competitive PEFT baseline but still uses more trainable parameters than Shared Entire SPT and is weaker than the strongest prompt variants on the reported language-expansion tasks. Third, prompt placement matters: decoder-only prompting is stronger than encoder-only prompting, yet Entire SPT is stronger than decoder-only prompting once prompt length is sufficiently large. Fourth, language awareness matters: LAPT improves both shared and separate prompt configurations (Yang et al., 16 Jun 2025).
The paper does not provide a dedicated ablation that completely isolates language awareness from parameter count. It nevertheless notes that the performance gain from LAPT is larger than the modest increase in parameter count in the shared setting, for example from 0.17M to 0.96M parameters. This suggests that the improvement is driven substantially by cross-lingual structure and language-aware initialization rather than by parameter inflation alone. The same evidence also indicates that prompt expansion in multilingual ASR is not merely a storage trick; it is a representational strategy for organizing adaptation modules around language similarity.
More broadly, the work presents a design pattern for continual multilingual adaptation of encoder–decoder speech models: prompt both encoder and decoder, decompose prompt capacity into shared and language-specific components, and use the model’s own language-identification behavior to anchor new languages to nearby existing ones. Within the paper’s experimental scope, that pattern yields a compact and modular alternative to retraining Whisper whenever language coverage must grow.