Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 103 tok/s

GPT OSS 120B 480 tok/s Pro

Kimi K2 215 tok/s Pro

2000 character limit reached

Weight Factorization and Centralization for Continual Learning in Speech Recognition (2506.16574v1)

Published 19 Jun 2025 in cs.CL, cs.SD, and eess.AS

Abstract: Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters.

Collections

Summary

The paper introduces a dual-stage strategy that uses low-rank adapters and periodic centralization to mitigate catastrophic forgetting in speech models.
It demonstrates significant improvements with an 11.7% gain in backward transfer and a 24.6% relative reduction in WER on code-switching benchmarks.
The approach is rehearsal-free and scalable, making it suitable for adapting large pretrained models in multilingual and privacy-sensitive settings.

Weight Factorization and Centralization for Continual Learning in Speech Recognition

The paper introduces a continual learning framework for speech recognition, designed to address the challenges of model adaptation in a rehearsal-free, multilingual, and language-agnostic context. It proposes a dual-stage strategy: a factorization phase that allocates new data streams into dedicated low-rank adapters, and a centralization phase that merges the accumulated knowledge from these adapters into the shared base model. This approach targets issues inherent to continual learning, particularly catastrophic forgetting, when adapting large-scale pretrained models to new, potentially code-switched speech data in the absence of original training data.

Problem Context and Motivation

Modern foundation models for speech recognition, such as Whisper, demonstrate robust multilingual performance but require efficient adaptation to domain- or language-specific datasets—often in code-switched circumstances or for low-resource languages. Traditional continual learning techniques rely on either rehearsal (reusing previous data), explicit regularization (e.g., EWC), or architectural expansion (e.g., expert modules or prompts). For foundation models where the original training data is unavailable, only parameter-efficient and data-free approaches are tenable.

The paper draws inspiration from the consolidation processes in the human brain, dividing learning into a "waking" (factorization) period—where new knowledge is collected—and a "sleeping" (centralization) phase—where knowledge is systematically integrated.

Methodological Approach

Factorization Phase:

Incoming data streams, corresponding to different languages, domains, or temporal segments, are assigned individual low-rank adapters based on LoRA.
Each adapter augments the base model through efficient, learnable updates restricted to rank-constrained matrices applied in selected layers (typically key and query projections in Transformers).

Centralization Phase:

Periodically, a fixed set (e.g., K) of adapters is averaged and merged back into the shared base model parameters.
This process resembles stochastic weight averaging and draws on Bayesian intuition: averaging Gaussian-distributed weights decreases variance, promoting weight centrality and sparsity, and reducing the risk of drifting far from the base model.
Weight decay/regularization encourages further sparsity and reduced deviation from initial weights.

The following pseudocode summarizes the core procedure:

for t in range(N):  # N = number of incoming datasets
    # Factorization: Fit a new LoRA adapter on the incoming dataset D_t
    adapter_t = train_adapter(base_model, D_t)
    adapters.append(adapter_t)

    # Centralization: Periodically average and merge adapters
    if (t + 1) % K == 0:
        avg_adapter = average_adapters(adapters[-K:])  # Last K adapters
        base_model = merge_adapter(base_model, avg_adapter)
        # Optionally discard or archive old adapters to save memory

The method can be generalized to non-discrete dataset boundaries by time-window or sample-based partitioning of incoming data.

Experimental Results

Experiments are conducted using six code-switching datasets for forward evaluation, including ArZen, Fisher, SEAME, TunSwitch, ASCEND, and TalCS, with backward evaluation on monolingual benchmarks in English, German, Arabic, Turkish, Mandarin, and Spanish. The adaptation target is the Whisper foundation model.

Key findings:

Catastrophic forgetting is observed when adapters are added for new tasks: e.g., the addition of adapters for Arabic-English or Mandarin-English code-switching dramatically increases WER/CER in unrelated languages (over 100% error in some cases).
Centralization substantially recovers or even improves backward transfer: After merging adapters in the centralization step, the base model’s error rates on previously seen and unseen monolingual test sets approach or surpass the original performance. In the "2nd Centralization" model, average backward evaluation improves by 11.7% over the base Whisper model.
Forward performance: On code-switching benchmarks, the centralized model achieves a 24.6% relative reduction in WER compared to the base model, outperforming both standard fine-tuning and rehearsal-free continual learning baselines leveraging stochastic weight averaging and distillation (SWADT).

Comparison with competitive baselines demonstrates the efficiency of the factorization-centralization pipeline, yielding comparable or superior trade-offs between stability (knowledge retention) and plasticity (learning new conditions). Notably, the method matches or exceeds average performance of rehearsal-free and regularization-based continual learning approaches, without ever accessing old training data.

Practical Implications and Theoretical Considerations

Practical implications:

The proposed framework is data-agnostic and compatible with black-box foundation models, making it directly deployable in commercial and privacy-sensitive environments.
Low-rank adapters provide scalability: only marginal overhead in storage and computation per adaptation event.
The only meaningful hyperparameter is the number of adapters per centralization cycle (K), simplifying operational tuning and deployment.
Memory cost is controlled: in practice, only the most recently trained K adapters and the merged base model must be stored.

Theoretical implications:

The empirical demonstration of positive backward transfer via centralization challenges the standard expectation that rehearsal-free methods can only mitigate, but not reverse, forgetting.
The Gaussian averaging perspective provides a theoretical explanation for reduced drift and increased sparsity after centralization, promoting robust parameter reuse and minimal interference.
Adapter conditioning after centralization inherently engenders a curriculum in parameter space, potentially enabling continual transfer in domains with non-discrete or overlapping task boundaries.

Limitations and Future Directions

The framework, while effective, achieves about 58% of the maximum improvement possible under joint full-data fine-tuning, indicating room for optimization in merging, regularization, or adapter design.
Current experimentation focuses on batch-wise continual learning; truly online adaptation (instant deployment per sample or micro-batch) requires further investigation.
The impact of different adapter allocation strategies (e.g., content-based versus strictly temporal) on linguistic or domain boundaries merits further analysis.
Future work could generalize the centralization concept to task-free settings, more granular data flows, or meta-learning regimes for flexible cross-domain adaptation.

Conclusion

This work provides a practical and theoretically motivated strategy for continual learning in large neural speech recognition models. By combining parameter-efficient low-rank adaptation with periodic, variance-minimizing centralization, the method robustly balances knowledge retention and plasticity across heterogeneous, rehearsal-free data streams. The approach is well-aligned with the realities of foundation model adaptation and offers a scalable path for incremental language, domain, or user personalization in speech systems. Robustness to catastrophic forgetting and empirical evidence of positive backward transfer highlight the potential for broad adoption in continual learning scenarios within speech and beyond.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now