LID-Posterior Driven LoRA Routing in mASR
- The paper introduces an end-to-end LID-posterior-driven routing mechanism that dynamically integrates language-specific LoRA experts with mHuBERT-CTC for single-pass, language-agnostic decoding.
- It details a hierarchical transformer structure where lower shared layers compute LID posteriors and upper layers apply soft-gated expert adaptation, optimizing both ASR and LID jointly.
- Empirical evaluations demonstrate that the method slightly improves word error rates and significantly boosts LID accuracy while halving the inference passes compared to two-stage systems.
LID-posterior-driven LoRA routing is an inference mechanism developed for hierarchical Low-Rank Adaptation Mixture-of-Experts (LoRA-MoE) architectures in multilingual automatic speech recognition (ASR). Operating within a Connectionist Temporal Classification (CTC) framework, this method enables language-agnostic, single-pass decoding by dynamically routing information through language-specific LoRA expert modules, gated by the model’s internally computed language identification (LID) posterior probabilities. This routing approach is central to the HLoRA framework, which is integrated into the mHuBERT-CTC backbone to achieve efficient and scalable domain adaptation for low-resource mASR while obviating the need for external language labels or multi-stage inference procedures (Zheng et al., 2 Jan 2026).
1. Computation of the LID Posterior in mHuBERT-CTC
The LID-posterior-driven routing begins by embedding a lightweight LID classifier at layer of the frozen mHuBERT-CTC encoder. Let denote the frame-level features after propagation through the shared LoRA-adapted lower layers:
The output is projected via a linear transformation followed by softmax to yield per-frame language logits and posteriors:
Final LID posterior vector is obtained by mean-pooling across time:
This posterior serves as the soft routing distribution, governing expert utilization in later transformer layers (Zheng et al., 2 Jan 2026).
2. Hierarchical LoRA-MoE Model Structure
The model partitions its transformer stack into total layers, divided into shared lower layers and upper expert layers.
- The shared multilingual LoRA () is injected into the projections for the initial layers, with parameterization:
, .
- Language-specific LoRA expert modules are similarly structured and inserted in upper layers , providing adaptation capacity tailored to each supported language:
- For each layer , transformer weights are formed as:
where is the LID posterior for language , serving as expert gating.
This design enables the model to learn both language-invariant acoustic mappings and language-dependent feature refinements (Zheng et al., 2 Jan 2026).
3. LID-Posterior-Driven Routing: Soft-Gating and Inference Procedure
The routing mechanism employs a soft-gating mixture at each expert layer, weighted by the internal LID posterior. Specifically, at layer :
The transformer block then operates with parameters for all remaining forward passes.
Single-pass inference can be summarized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Input: speech X 1. Extract features through frozen CNN. 2. For ℓ=1…k: H_ℓ = TransformerLayer(H_{ℓ−1}; W_base, ΔW_s) 3. Aggregate H_k over time → Ḣ_k 4. Compute LID logits: Y_LID = Ḣ_k·W_LIDᵀ + b_LID 5. Compute posterior p = softmax(Y_LID) 6. Initialize H = H_k 7. For ℓ=k+1…N: ΔW_total = ΔW_s + Σ_{i=1}^L p[i]·ΔW_L^i H ← TransformerLayer(H; W_base, ΔW_total) 8. Compute CTC‐score = CTCHead(H; W_CTC_base + Σ_i p[i]·ΔW_CTC^i) 9. Decode best path with beam/greedy CTC Output: transcription |
4. Model Specification and Computational Properties
The standard configuration employs:
- transformer layers (mHuBERT-147), with shared.
- primary languages (EN-IN, FR, DE, JA, KO), with 6 additional “other” languages for LID.
- LoRA rank , for projections and CTC layer.
- Hidden dimension .
- Parameter overhead: adds 1.5M, each expert 0.2M, total 5M extra parameters, overall model size 102M vs 97M baseline.
Inference overhead is minimal: a single additional linear transform and softmax (for 11 languages), negligible compared to transformer computation. Critically, elimination of a second forward pass (required in two-stage systems) approximately halves encoder-side computational cost (Zheng et al., 2 Jan 2026).
5. Decoding Efficiency and Empirical Accuracy
Evaluation on the MLC-SLM test set demonstrates that LID-posterior-driven routing achieves word error rates (WER) and LID accuracies on par with or exceeding prior two-stage LoRA-based adaptation, while substantially improving decoding efficiency. The empirical results are summarized in the following table:
| ID | System | LID Rep. | Passes | WER (%) |
|---|---|---|---|---|
| S4 | mHuBERT-CTC + two-stage LoRA | ✗ | 2 | 24.8 |
| S6 | mHuBERT-CTC-HLoRA (ours) | ✗ | 1 | 24.7 |
- HLoRA attains WER 24.7%, slightly better than two-stage (24.8%), with half the inference passes.
- LID accuracy increases from 90.1% (two-stage) to 97.9% (HLoRA) as a consequence of end-to-end joint optimization.
- Ablation over supports as the optimal shared-expert split (26.0% WER on dev/test) (Zheng et al., 2 Jan 2026).
6. Language-Agnostic Single-Pass Decoding
LID-posterior-driven LoRA routing leverages only the internally computed LID posterior, obviating the need for external language identity annotations or separate LID modules at inference. This approach eliminates error propagation between sequential LID/ASR phases, precludes manual language switching or metadata requirements, and preserves on-device, low-latency operation by requiring only a single forward pass. The resulting system embodies true language-agnostic multilingual ASR, jointly learning ASR and LID without auxiliary supervision or post-hoc rerouting (Zheng et al., 2 Jan 2026).