MamBo-3-Hydra-N3: Hybrid Audio Deepfake Detection

Updated 13 January 2026

MamBo-3-Hydra-N3 is a modular hybrid deep learning architecture that integrates pre-trained XLSR, bidirectional Hydra SSM blocks, and non-causal Transformer attention for audio deepfake detection.
The configuration interleaves deep bidirectional state-space modeling with attention mechanisms to effectively capture both local features and global temporal dependencies across varied benchmarks.
Empirical studies show that deeper stacking and increased backbone layers significantly reduce error rates and variance, enhancing robustness against diverse spoofing attacks.

The MamBo-3-Hydra-N3 configuration is a modular hybrid deep learning architecture designed for audio deepfake detection (ADD), integrating a pre-trained XLSR front-end with a scalable backbone that interleaves structured state-space modeling (Hydra) and non-causal Transformer attention. The configuration is characterized by deep bidirectional SSM blocks (Hydra-N3) and classic interleaving topology, enabling efficient modeling of both local and global temporal dependencies, and exhibiting strong performance and stability across in-domain and out-of-domain audio anti-spoofing benchmarks (Ng et al., 6 Jan 2026).

1. Architecture and Data Flow

MamBo-3-Hydra-N3 features a sequence of transformations from raw waveform input through hierarchical feature extraction and classification stages:

Input Encoding: The raw 16 kHz audio waveform is processed by a pre-trained XLSR encoder, producing frame-wise features with dimensionality 1024.
Feature Projection: RMSNorm and a linear projection layer transform the XLSR features to a hidden sequence $H \in \mathbb{R}^{T \times D}$ .
MamBo-3 Backbone: The core encoder comprises $L$ $L$ MamBo-3 layers (default $L=5$ $L = 5$ ), each sequentially applying:
- a Hydra block (with stacking depth $N=3$ ), providing deep bidirectional state-space processing,
- Pre-Norm followed by a non-causal Multi-Head Self-Attention (MHA) sublayer,
- a SwiGLU feed-forward network,
- residual connections encapsulating each submodule.
Pooling and Classification: Utterance-level representations are obtained via gated attention pooling, followed by a final linear projection to binary logits for spoof/bona-fide discrimination.

A single MamBo-3 layer executes the following update sequence: $\begin{aligned} X^{(0)} &\gets \mathrm{LayerNorm}(X_{\text{in}}),\ X^{(1)} &\gets X_{\text{in}} + \mathrm{Hydra}_{N=3}(X^{(0)}),\ X^{(2)} &\gets X^{(1)} + \mathrm{MHA}(\mathrm{LayerNorm}(X^{(1)})),\ X_{\text{out}} &\gets X^{(2)} + \mathrm{SwiGLU}(\mathrm{LayerNorm}(X^{(2)})). \end{aligned}$

2. Topological Design and Parameterization

The designation "MamBo-3" refers to the layer topology: one SSM block (Hydra variant) paired with one non-causal Transformer block per composite backbone layer, forming a classic interleaving scheme. Each backbone encoder comprises $L$ repeated instances of this composite unit.

The "Hydra-N3" label specifies the SSM block’s intra-layer stacking; each Hydra comprises $N=3$ sequentially composed bidirectional state-space stages. This deep stacking is intended to increase the receptive field and capacity for capturing long-range recurrences before information is processed by the attention mechanism.

3. Mathematical Formulation

3.1 Structured State-Space (Hydra) Block

Hydra is formulated as a quasiseparable linear recurrence, bidirectionally parameterized:

For sequence inputs $u_{1:T} \in \mathbb{R}^{T \times d}$ , the $i$ -th state-space stage updates as: $\begin{aligned} \text{(forward)} \quad & h^{(i)}_t = A^{(i)} h^{(i)}_{t-1} + B^{(i)} u_t, \ \text{(backward)} \quad & \tilde h^{(i)}_t = \tilde A^{(i)} \tilde h^{(i)}_{t+1} + \tilde B^{(i)} u_t, \ \text{(output)} \quad & y_t = C^{(i)}[h^{(i)}_t;\, \tilde h^{(i)}_t] + D^{(i)} u_t, \end{aligned}$ where $[\,\cdot\,;\,\cdot\,]$ indicates concatenation. Stacking $i=1,\dots,N$ SSM stages yields Hydra-N3.

3.2 Attention and Feed-Forward Integration

The Transformer sublayer applies standard multi-head self-attention: $\begin{aligned} Q_k &= XW^Q_k, \quad K_k = XW^K_k, \quad V_k = XW^V_k, \ \operatorname{Att}_k(X) &= \operatorname{softmax} \left( \frac{Q_k K_k^\top}{\sqrt{d_k}} \right) V_k, \ \operatorname{MHA}(X) &= \sum_k \operatorname{Att}_k(X) W^O_k. \end{aligned}$ Residual mixing is applied stepwise, using Pre-Norm at each submodule boundary.

The feed-forward sublayer uses the SwiGLU activation: $\mathrm{SwiGLU}(x) = (W_1 x) \odot \sigma(W_2 x) W_3,$ with $\sigma$ denoting the sigmoid function.

4. Depth Scaling and Stability Effects

Depth scaling in MamBo-3-Hydra-N3 is controlled by the number of MamBo-3 backbone layers ( $L$ , default 5, ablated up to 7) and the stacking depth of state-space stages per Hydra ( $N = 3$ ). Empirical analysis demonstrates that increasing both $L$ and $N$ :

Leads to tighter clustering of top-5 checkpoint performances, reducing variance across independently trained runs,
Results in lower sensitivity to previously unseen generative attack types, thus enhancing generalization,
Mitigates high inter-checkpoint variance and instability observed in shallow configurations, supporting more stable optimization trajectories and inference (Ng et al., 6 Jan 2026).

A plausible implication is that the enhanced recurrence depth provided by deeper Hydra stacking more robustly captures global sequence artifacts pertinent to the audio deepfake detection task.

5. Empirical Assessment and Comparative Results

Performance of MamBo-3-Hydra-N3 (with $L=5$ ) was benchmarked on major ADD evaluation suites including ASVspoof 2021 LA, DF, In-the-Wild, and DFADD datasets, with models cross-tested without auxiliary data. Table 1 synthesizes the reported key results:

Dataset	Metric	MamBo-3-Hydra-N3	Best Prior	Relative Gain
ASV21LA	min t-DCF	0.2072	0.2178¹	+4.9%
	EER (%)	0.81	0.93²	+12.9%
ASV21DF	EER (%)	1.70	1.88³	+9.6%
In-the-Wild	EER (%)	4.97	6.71³	+25.9%
DFADD D1–D3	EER (%)	1.84/1.33/0.00	–	–
DFADD F1–F2	EER (%)	11.36/16.01	–	–

¹ RawMamba (N=1) on ASV21LA. ² XLSR-Mamba on ASV21LA. ³ Fake-Mamba (L). On the DFADD subsets, Hydra-N3 demonstrates near-zero EER on the easiest "D" test sets and competitive single-digit EER on the most challenging "F" subsets.

6. Analysis of Hydra Bidirectionality and Ablation Outcomes

Ablation studies sweeping the Hydra stacking parameter $N \in \{1, 2, 3\}$ reveal performance gains from enhanced bidirectionality:

$N=1$ (causal-only): Underperforms substantially on all evaluated testbeds, particularly on out-of-domain DFADD–F2.
$N=2$ : Improves EER by 10–15% relative to $N=1$ , indicating clear gains from limited bidirectionality.
$N=3$ : Achieves the best and most stable performance metrics (EER/min t-DCF), across both in-domain (ASV21LA, DF) and out-of-domain (ITW, DFADD) evaluations.

These outcomes confirm that Hydra’s native, quasiseparable parameterization for bidirectional mixing is more effective at extracting global temporal artifacts than previous heuristic dual-branch or shallow-stacking alternatives (Ng et al., 6 Jan 2026). The structure alleviates typical challenges faced by purely causal SSM architectures in content-based artifact retrieval for speech anti-spoofing.

7. Significance and Application

MamBo-3-Hydra-N3 provides an effective hybrid backbone that combines the linear computational efficiency of SSMs with the expressive power of non-causal attention, specifically tailored for detecting subtle generative artifacts in spoofed speech. Its scalable, modular design supports robust generalization across unseen generative methods (diffusion, flow-matching), marking it as a competitive choice for current and future audio deepfake detection benchmarking (Ng et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MamBo-3-Hydra-N3 Configuration.

MamBo-3-Hydra-N3: Hybrid Audio Deepfake Detection

1. Architecture and Data Flow

2. Topological Design and Parameterization

3. Mathematical Formulation

3.1 Structured State-Space (Hydra) Block

3.2 Attention and Feed-Forward Integration

4. Depth Scaling and Stability Effects

5. Empirical Assessment and Comparative Results

6. Analysis of Hydra Bidirectionality and Ablation Outcomes

7. Significance and Application

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MamBo-3-Hydra-N3: Hybrid Audio Deepfake Detection

1. Architecture and Data Flow

2. Topological Design and Parameterization

3. Mathematical Formulation

3.1 Structured State-Space (Hydra) Block

3.2 Attention and Feed-Forward Integration

4. Depth Scaling and Stability Effects

5. Empirical Assessment and Comparative Results

6. Analysis of Hydra Bidirectionality and Ablation Outcomes

7. Significance and Application

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research