Dilated FAVOR Conformer Architecture

Updated 29 November 2025

The paper introduces an architecture that fuses Conv-TasNet’s filterbank with Conformer-inspired masking using FAVOR+ linear attention to achieve end-to-end denoising with linear complexity.
Dilated FAVOR Conformer employs 1D depthwise-separable dilated convolutions to capture local context while leveraging FAVOR+ attention for global sequence modeling.
The design is further enhanced by integrating Hydra state-space models to replace random-feature approximations, improving accuracy in long-range speech enhancement tasks.

The Dilated FAVOR Conformer (DF-Conformer) is an efficient neural architecture for speech enhancement that unifies Conv-TasNet’s filterbank and the Conformer’s sophisticated masking module. DF-Conformer attains linear computational complexity via two central mechanisms: (i) FAVOR+ linear-attention in place of standard multi-head self-attention, and (ii) 1D depthwise-separable dilated convolution. This enables joint local and global context modeling in end-to-end denoising pipelines and allows practical scaling to long input sequences. Subsequent advancements employ bidirectional structured state-space models, notably the Hydra mixer, in place of FAVOR+ to eliminate random-feature approximations while maintaining linear complexity and further improving global sequence modeling (Koizumi et al., 2021, Seki et al., 4 Nov 2025).

1. Architectural Overview

DF-Conformer integrates the Conv-TasNet filterbank encoder–decoder with a Conformer-inspired, block-stacked mask prediction network. The dataflow is as follows:

Encoder: A 1D convolutional analysis filterbank ( $\mathrm{Enc}(x)$ ) extracts non-negative features $X\in\mathbb{R}^{N\times D_e}$ from the input waveform $x(t)$ . Window length is 2.5 ms (40 samples), stride is 1.25 ms (20 samples), and number of filters $D_e=256$ .
Mask Prediction Network: The feature map $X$ is projected to $D_b$ channels and passed through $L$ DF-Conformer blocks. Each block uses exponentially increasing dilation $d_i=2^{(i-1~\mathrm{mod}~L_s)}$ for local context, and applies FAVOR+ self-attention for global context. The output mask $M\in[0,1]^{N\times D_e}$ is obtained via a final dense layer and sigmoid activation.
Decoder: The mask is applied elementwise to the encoded features, and a transposed convolutional synthesis filterbank reconstructs the enhanced waveform $\hat{y}(t)$ .

Each DF-Conformer block interleaves macaron-style feedforward layers, FAVOR+ multi-head self-attention, and 1D depthwise dilated convolution. The block-wise design allows the receptive field to grow exponentially, periodically resetting, and covers long-range dependencies (Koizumi et al., 2021).

2. FAVOR+ Linear-Complexity Attention

Traditional multi-head self-attention in the Conformer and Transformer incurs $O(N^2)$ time and memory. FAVOR+ addresses this bottleneck by replacing the softmax kernel in scaled dot-product attention with a positive random-feature map $\phi:\mathbb{R}^{d_k}\to\mathbb{R}^{D_r}_+$ . This map satisfies

$\mathrm{softmax}(q^\mathsf{T}k)\approx \phi(q)^\mathsf{T}\phi(k)$

where each $\phi(x)_i=\exp(\omega_i^\mathsf{T}x - \frac12\|x\|^2)$ for orthogonal Gaussians $\omega_i$ . The FAVOR+ approximation to attention is

$\mathrm{Attention}(Q, K, V) \approx D^{-1}\phi(Q)[\phi(K)^\mathsf{T}V]$

with a diagonal normalization $D$ . The computational cost is reduced to $O(ND_rH)$ for $N$ sequence length, $D_r$ random features, and $H$ heads, ensuring linear time and memory. The DF-Conformer utilizes $D_r=384$ and six attention heads, with random features fixed at initialization (Seki et al., 4 Nov 2025, Koizumi et al., 2021).

3. Dilated Depthwise Separable Convolution

Local context aggregation is performed by depthwise separable 1D convolutions with exponentially increasing dilation. Within a block, hidden state $Z$ is batch-normalized and projected through a Gated Linear Unit (GLU) and depthwise convolution: $r(t) = \sum_{\tau=-(k-1)/2}^{(k-1)/2} w_\tau\,\mathrm{GLU}(Z)_{t-d_i\tau}$ Kernel size is 3 (default) or 5 (as in TDCN++). Dilation doubles every four blocks ( $L_s=4$ ), so the receptive field grows exponentially: $R_B = 1 + (k-1)\sum_{b=1}^B 2^{\lfloor (b-1)/4\rfloor}$ This design achieves joint fine-grained and long-range sequence modeling with low overhead. Residual connections and interleaving with FAVOR+ attention complete each convolutional module (Koizumi et al., 2021, Seki et al., 4 Nov 2025).

4. Hydra State-Space Model Integration

A recent refinement replaces FAVOR+ attention in the DF-Conformer with Hydra, a bidirectional, selective state-space sequence model (SSM) structured as a matrix mixer. While FAVOR+ yields a low-rank attention matrix via random features, Hydra constructs a bidirectional quasi-separable matrix: $\breve m_{ij} = \begin{cases} \overrightarrow{c}_{i-1}^\mathsf{T}(\prod_{k=j+1}^{i-1}\overrightarrow{A}_k)\overrightarrow{b}_j, & i>j \ \delta_i, & i=j \ \overleftarrow{c}_{i+1}^\mathsf{T}(\prod_{k=i+1}^{j-1}\overleftarrow{A}_k)\overleftarrow{b}_j, & i<j \end{cases}$ with learned diagonal $\delta_i$ and separate forward/backward parameters. This framework achieves linear complexity while providing exact, expressive global mixing, eliminating the FAVOR+ approximation bottleneck. Hydra-enhanced DF-Conformer (“DC-Hydra”) achieves equal or better metrics than both FAVOR+ and softmax attention (DNSMOS, UTMOS, SpkSim, CAcc) in generative SE tasks, with robust scaling up to 96-second sequences and no major compromise in real-time factor or memory usage (Seki et al., 4 Nov 2025).

5. Empirical Performance and Complexity

Extensive evaluation on single-channel speech enhancement, using 3,396 hours of noisy speech (LibriVox + freesound, SNR from −40 to +45 dB), demonstrates the benefits of the DF-Conformer architecture:

Model	Params (M)	SI-SNRi (dB)	ESTOI (%)	RTF
TDCN++	8.75	14.10	85.7	0.10
Conv-Tasformer	8.71	14.36	85.6	0.25
DF-Conformer-8	8.83	14.43	85.4	0.13
iDF-Conformer-8	17.8	15.28	87.1	0.26
iDF-Conformer-12	37.0	15.93	88.4	0.46

DF-Conformer-8 surpasses TDCN++ by +0.33 dB SI-SNRi at comparable model size and real-time factor (Koizumi et al., 2021). Hydra-based DC-Hydra surpasses the FAVOR+ model in DNSMOS (3.44 vs. 3.44), UTMOS (3.48 vs. 3.33), SpkSim (0.83 vs. 0.79), and CAcc (88.95% vs. 88.24%) at similar parameter counts (106M vs. 98M) (Seki et al., 4 Nov 2025).

6. Design Considerations, Limitations, and Prospects

DF-Conformer’s integration of FAVOR+ with dilated convolution balances local and global sequence modeling, enabling efficient, real-time filterbank-based SE for long sequences ( $N\approx500$ frames/s). Early layers dominate with broad, utterance-wide attention (noise estimation), while deep blocks transition to more local, near-diagonal patterns (error correction). Key limitations are the reliance on random-feature approximations (with error determined by $D_r$ ) and the use of fixed dilation schedules, whose optimality can be task-dependent. The adoption of Hydra SSMs addresses the first limitation, replacing random projections by exact, bidirectional mixing without increasing asymptotic complexity.

Future development directions include learned feature maps for attention, hybrid attention–convolution patterns, joint SE–ASR training with all-Conformer backbones, and systematic comparison to alternative sparse/structured attention methods such as SepFormer and DP-Transformer. A plausible implication is that state-space models, exemplified by Hydra, may supplant existing linear attention mechanisms for modeling long-range dependencies in speech and audio enhancement pipelines (Koizumi et al., 2021, Seki et al., 4 Nov 2025).