MamBo-3-Hydra-N3: Hybrid Audio Deepfake Detection
- MamBo-3-Hydra-N3 is a modular hybrid deep learning architecture that integrates pre-trained XLSR, bidirectional Hydra SSM blocks, and non-causal Transformer attention for audio deepfake detection.
- The configuration interleaves deep bidirectional state-space modeling with attention mechanisms to effectively capture both local features and global temporal dependencies across varied benchmarks.
- Empirical studies show that deeper stacking and increased backbone layers significantly reduce error rates and variance, enhancing robustness against diverse spoofing attacks.
The MamBo-3-Hydra-N3 configuration is a modular hybrid deep learning architecture designed for audio deepfake detection (ADD), integrating a pre-trained XLSR front-end with a scalable backbone that interleaves structured state-space modeling (Hydra) and non-causal Transformer attention. The configuration is characterized by deep bidirectional SSM blocks (Hydra-N3) and classic interleaving topology, enabling efficient modeling of both local and global temporal dependencies, and exhibiting strong performance and stability across in-domain and out-of-domain audio anti-spoofing benchmarks (Ng et al., 6 Jan 2026).
1. Architecture and Data Flow
MamBo-3-Hydra-N3 features a sequence of transformations from raw waveform input through hierarchical feature extraction and classification stages:
- Input Encoding: The raw 16 kHz audio waveform is processed by a pre-trained XLSR encoder, producing frame-wise features with dimensionality 1024.
- Feature Projection: RMSNorm and a linear projection layer transform the XLSR features to a hidden sequence .
- MamBo-3 Backbone: The core encoder comprises MamBo-3 layers (default ), each sequentially applying:
- a Hydra block (with stacking depth ), providing deep bidirectional state-space processing,
- Pre-Norm followed by a non-causal Multi-Head Self-Attention (MHA) sublayer,
- a SwiGLU feed-forward network,
- residual connections encapsulating each submodule.
- Pooling and Classification: Utterance-level representations are obtained via gated attention pooling, followed by a final linear projection to binary logits for spoof/bona-fide discrimination.
A single MamBo-3 layer executes the following update sequence:
2. Topological Design and Parameterization
The designation "MamBo-3" refers to the layer topology: one SSM block (Hydra variant) paired with one non-causal Transformer block per composite backbone layer, forming a classic interleaving scheme. Each backbone encoder comprises repeated instances of this composite unit.
The "Hydra-N3" label specifies the SSM block’s intra-layer stacking; each Hydra comprises sequentially composed bidirectional state-space stages. This deep stacking is intended to increase the receptive field and capacity for capturing long-range recurrences before information is processed by the attention mechanism.
3. Mathematical Formulation
3.1 Structured State-Space (Hydra) Block
Hydra is formulated as a quasiseparable linear recurrence, bidirectionally parameterized:
- For sequence inputs , the -th state-space stage updates as: where indicates concatenation. Stacking SSM stages yields Hydra-N3.
3.2 Attention and Feed-Forward Integration
The Transformer sublayer applies standard multi-head self-attention: Residual mixing is applied stepwise, using Pre-Norm at each submodule boundary.
The feed-forward sublayer uses the SwiGLU activation: with denoting the sigmoid function.
4. Depth Scaling and Stability Effects
Depth scaling in MamBo-3-Hydra-N3 is controlled by the number of MamBo-3 backbone layers (, default 5, ablated up to 7) and the stacking depth of state-space stages per Hydra (). Empirical analysis demonstrates that increasing both and :
- Leads to tighter clustering of top-5 checkpoint performances, reducing variance across independently trained runs,
- Results in lower sensitivity to previously unseen generative attack types, thus enhancing generalization,
- Mitigates high inter-checkpoint variance and instability observed in shallow configurations, supporting more stable optimization trajectories and inference (Ng et al., 6 Jan 2026).
A plausible implication is that the enhanced recurrence depth provided by deeper Hydra stacking more robustly captures global sequence artifacts pertinent to the audio deepfake detection task.
5. Empirical Assessment and Comparative Results
Performance of MamBo-3-Hydra-N3 (with ) was benchmarked on major ADD evaluation suites including ASVspoof 2021 LA, DF, In-the-Wild, and DFADD datasets, with models cross-tested without auxiliary data. Table 1 synthesizes the reported key results:
| Dataset | Metric | MamBo-3-Hydra-N3 | Best Prior | Relative Gain |
|---|---|---|---|---|
| ASV21LA | min t-DCF | 0.2072 | 0.2178¹ | +4.9% |
| EER (%) | 0.81 | 0.93² | +12.9% | |
| ASV21DF | EER (%) | 1.70 | 1.88³ | +9.6% |
| In-the-Wild | EER (%) | 4.97 | 6.71³ | +25.9% |
| DFADD D1–D3 | EER (%) | 1.84/1.33/0.00 | – | – |
| DFADD F1–F2 | EER (%) | 11.36/16.01 | – | – |
¹ RawMamba (N=1) on ASV21LA. ² XLSR-Mamba on ASV21LA. ³ Fake-Mamba (L). On the DFADD subsets, Hydra-N3 demonstrates near-zero EER on the easiest "D" test sets and competitive single-digit EER on the most challenging "F" subsets.
6. Analysis of Hydra Bidirectionality and Ablation Outcomes
Ablation studies sweeping the Hydra stacking parameter reveal performance gains from enhanced bidirectionality:
- (causal-only): Underperforms substantially on all evaluated testbeds, particularly on out-of-domain DFADD–F2.
- : Improves EER by 10–15% relative to , indicating clear gains from limited bidirectionality.
- : Achieves the best and most stable performance metrics (EER/min t-DCF), across both in-domain (ASV21LA, DF) and out-of-domain (ITW, DFADD) evaluations.
These outcomes confirm that Hydra’s native, quasiseparable parameterization for bidirectional mixing is more effective at extracting global temporal artifacts than previous heuristic dual-branch or shallow-stacking alternatives (Ng et al., 6 Jan 2026). The structure alleviates typical challenges faced by purely causal SSM architectures in content-based artifact retrieval for speech anti-spoofing.
7. Significance and Application
MamBo-3-Hydra-N3 provides an effective hybrid backbone that combines the linear computational efficiency of SSMs with the expressive power of non-causal attention, specifically tailored for detecting subtle generative artifacts in spoofed speech. Its scalable, modular design supports robust generalization across unseen generative methods (diffusion, flow-matching), marking it as a competitive choice for current and future audio deepfake detection benchmarking (Ng et al., 6 Jan 2026).