Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mamba-based HuBERT Models

Updated 30 June 2025
  • Mamba-based HuBERT models are self-supervised speech systems that replace Transformer blocks with linear-time Mamba SSM layers to enhance long-sequence processing.
  • They improve real-time and streaming ASR by reducing memory usage while achieving superior phone purity and speaker feature separation.
  • Integrating efficient fusion and hybrid architectures, these models offer scalable speech representations for applications like live transcription and speech synthesis.

Mamba-based HuBERT models refer to self-supervised speech representation systems in which the core Transformer blocks of the original HuBERT architecture are replaced, in whole or in part, with Mamba blocks—specialized Selective State Space Model (SSM) layers that process sequences in linear time. This architectural shift leverages the computational and representational advantages of Mamba SSMs to enhance long-sequence modeling, improve efficiency for real-time and streaming automatic speech recognition (ASR), and generate higher-quality quantized speech units for subsequent downstream processing. Recent works systematically compare these models to Transformer-based HuBERT variants and analyze their performance, efficiency, and representational properties using both empirical and information-theoretic methods.

1. Model Architecture and Core Principles

Mamba-based HuBERT substitutes standard Transformer encoder layers with Mamba SSM blocks in the HuBERT (Hidden-unit BERT) self-supervised learning framework. The typical Mamba SSM formulation for sequential modeling is:

ht=Aht1+Bxt,yt=Chth_t = \overline{A} h_{t-1} + \overline{B} x_t, \quad y_t = C h_t

where hth_t is the state at time tt, xtx_t is the input, and A,B,C\overline{A}, \overline{B}, C are parameter matrices. The key innovation over classical SSMs is that these matrices, as well as the time-step parameter Δ\Delta, are made input-dependent:

B=fB(x),C=fC(x),Δ=BroadcastD(fΔ(x))B = f_B(x),\quad C = f_C(x),\quad \Delta = \mathrm{Broadcast}_D(f_\Delta(x))

This enables the SSM to adaptively control memory and information flow, implementing selective forgetting and propagation mechanisms akin to gated RNNs but with state updates parallelizable across sequences.

Compared to self-attention mechanisms in Transformers, Mamba SSMs do not require pairwise computation over all input tokens. Instead, state updates are recurrrent and depend only on the previous state and current input, producing linear scaling for both compute and memory with respect to input sequence length. In practice, Mamba-based HuBERT models can be constructed in both causal (uni-directional) and bidirectional forms, where the former is especially suited to real-time streaming use cases.

2. Empirical Performance and Computational Efficiency

Automatic Speech Recognition:

  • On long-context ASR (e.g., TEDLIUM3, document-level transcripts), Mamba-based HuBERT models outperform Transformer baselines in both memory efficiency and scalability. For example, in document-level ASR, the ExtBiMamba model achieves a WER of 11.08% versus the Transformer’s out-of-memory failure at comparable model size.
  • In streaming/causal settings, a Mamba base model with 78.2M parameters attains a WER of 15.77%, compared to 16.66% for a larger 94.7M parameter causal Transformer.

Probing Tasks (SUPERB):

  • On phoneme recognition (PR), speaker identification (SID), and other probing tasks, causal Mamba-based HuBERT models achieve lower error and higher speaker-aware performance than Transformers, especially in the small-model regime.
  • For example, in the causal setting: PR error 11.68% (Mamba) vs. 13.87% (Transformer); SID 73.07% vs. 60.04% (base-size models).

Computational Scaling:

  • Mamba-based HuBERT maintains flat (constant) MACs/sec and real-time factor as input durations increase, where Transformer incurs quadratic resource scaling and fails at input durations above ~80s due to memory constraints.
  • This linearity enables practical inference on multi-minute audio, document-level speech, or real-time streaming audio—domains where classic Transformer HuBERTs are infeasible.

3. Advances in Speech Representation and Downstream Flexibility

Mamba-based HuBERT models produce distinct, high-quality quantized speech representations:

  • Phone Purity: The alignment between quantized units and ground-truth phonemes (“phone purity”) is significantly higher for Mamba-based models, particularly in causal configurations, providing improved units for downstream generative or contrastive speech tasks.
  • Speaker Feature Separation: Through Canonical Correlation Analysis (CCA), Mamba models exhibit higher correlation with speaker embeddings than their Transformer counterparts, indicating stronger separation and representation of speaker identity at appropriate network layers.
  • Layer-wise Representational Trends: Lower Mamba layers encode speaker information, while higher layers focus on phonetic content—mirroring but often outperforming analogous Transformer HuBERT trends.

These properties make Mamba-based HuBERT models attractive for both speech unit extraction (serving as a front-end for unsupervised speech generation and recognition) and tasks requiring robust speaker-aware representations.

4. Information-Theoretic Characterization and Task Alignment

Analysis under the information bottleneck framework demonstrates that the learning dynamics and suitability of Mamba-based HuBERT models are closely tied to the nature of the downstream speech task:

  • Reconstruction vs. Classification: Mamba-based models innately excel in reconstruction tasks (e.g., spectrum recovery, enhancement), displaying a U-shaped mutual information (MI) curve across network depth (MI first decreases due to compression, then increases as input is reconstructed). For pure classification tasks such as ASR, the MI trend typically decreases monotonically unless an explicit decoder (e.g., a Conformer) is appended.
  • Hybrid Architectures: Augmenting Mamba-HuBERT with a downstream decoder module converts the MI curve to U-shaped and closes the performance gap with Transformer-based HuBERT, as shown by parallel reductions in WER and improved representation recovery.

Key equations: Ii(X;Ti)=H(X)H(XTi)=DKL(P(X,Ti)P(X)P(Ti))I_i(X;T_i) = H(X) - H(X|T_i) = D_{KL}(P(X,T_i) \Vert P(X)P(T_i)) (Mutual information at layer ii between input XX and intermediate representation TiT_i).

This information-theoretic approach provides insight into the task–architecture alignment, advocating pure Mamba for direct reconstruction, and reconstructor/classifier hybrids for tasks like ASR.

5. Integration Strategies, Multi-Resolution Fusion, and Model Variants

Several architectural and methodological advances can be directly applied or adapted to Mamba-based HuBERT systems:

  • Multiple Resolution (MR) Fusion: The MR-HuBERT protocol—combining independently pretrained HuBERT (now, Mamba-based) models at different temporal resolutions—remains valid. Parallel and hierarchical fusions are both feasible, leveraging Mamba’s efficiency to process high-resolution and long-duration audio representations without quadratic bottlenecks.

Example parallel fusion (for three Mamba-based HuBERTs at resolutions R1,R2,R3R_1, R_2, R_3):

XMRP=i=0N(w1,iUP1(X1i)+w2,iUP2(X2i)+w3,iUP3(X3i))X_{\mathrm{MR-P}} = \sum_{i=0}^N \left( w_{1,i} \cdot \mathrm{UP}_1(X_1^i) + w_{2,i} \cdot \mathrm{UP}_2(X_2^i) + w_{3,i} \cdot \mathrm{UP}_3(X_3^i) \right)

where UPk\mathrm{UP}_k upsamples to common output length and wk,iw_{k,i} are learnable fusion weights.

  • Cross-Architecture Adaptation and Hybridization: TransMamba and hybrid Mamba-Transformer models (such as Nemotron-H) introduce weight subcloning, adaptive bidirectional distillation, and cross-modal fusion modules. These allow knowledge transfer from pre-trained Transformer-based HuBERT to Mamba-based variants, enable multimodal representations (e.g., speech-text), and facilitate efficient scaling and compression (e.g., MiniPuzzle).
  • Compression and Quantization: Methods such as MiniPuzzle enable pruning and distillation of large hybrid models (e.g., Nemotron-H), allowing Mamba-heavy models to be reduced in size and memory footprint without sacrificing accuracy—enabling deployments on resource-constrained hardware.

6. Practical Applications, Performance, and Open Challenges

Applications:

  • Real-time and streaming ASR, supporting live transcription with low latency and high accuracy.
  • Long-document ASR, e.g., processing entire TED talks or multi-minute audio, which is infeasible with original Transformer-based HuBERT.
  • Speech unit extraction for speech synthesis, spoken LLMing, and low-resource speech recognition.
  • Speaker identification and diarization, based on superior disentanglement of speaker features.

Performance Benchmarks:

  • In causal settings, Mamba-based HuBERTs deliver lower WER and higher speaker/phoneme probe scores than Transformers at smaller parameter and compute budgets.
  • In document-level and high-context ASR, only Mamba-based (or Mamba-hybrid) architectures remain computationally tractable.

Research Directions and Limitations:

  • Scalability in Bidirectional Mamba: Base-scale bidirectional Mamba models currently lag behind their Transformer counterparts, prompting research into more scalable and stable bidirectional SSM designs.
  • Fusion and Alignment: Multi-resolution and hybrid fusion require careful handling of feature alignment across temporal scales and architectures, with future work needed on better upsampling/interpolation and parameter-sharing schemes.
  • Generalization: While high phone purity and task accuracy often co-occur, exceptions prompt development of improved diagnostic metrics for representation quality.

7. Summary Table: Comparison of Transformer and Mamba-based HuBERT Variants

Aspect Transformer HuBERT Mamba-based HuBERT
Encoder Backbone Transformer State Space Model (Mamba)
Scaling w/ Input O(T2)O(T^2) (quadratic) O(T)O(T) (linear)
Long-sequence Input Memory/compute bottleneck (OOM) Efficient; negligible growth
Causal/Streaming Higher WER, larger size Lower WER, smaller models
Phone Purity/Speaker Good Superior for causal models
Hybrid/MR Fusion Supported Supported, more efficient
Knowledge Transfer Native Via adaptation, distillation

References

Key supporting works include "An Exploration of Mamba for Speech Self-Supervised Models" (2506.12606), "Rethinking Mamba in Speech Processing by Self-Supervised Models" (2409.07273), architecture adaptation via "TransMamba" (2502.15130), hybrid/efficient variants in "Nemotron-H" (2504.03624), general SSM and Mamba analyses (2406.16722), and foundational studies on multi-resolution HuBERT (2306.01084).

In summary: Mamba-based HuBERT models establish a class of speech SSL systems characterized by linear-time efficiency, strong real-time and long-context modeling, and enhancements in speech representation quality, with considerable implications for scalable, robust, and adaptable speech AI.