DF-Conformer: Dilated FAVOR for Speech Enhancement
- DF-Conformer is a mask-based sequential model for speech enhancement that replaces traditional MHSA with linear FAVOR+ attention and employs exponentially dilated depthwise convolutions.
- The architecture integrates a Conv-TasNet-inspired encoder with recursive DF-Conformer blocks to achieve improved SI-SNRi and competitive real-time factors over established baselines.
- Recent extensions using Hydra state-space models address FAVOR+ limitations by enhancing content accuracy while maintaining efficient, linear complexity for long utterances.
The Dilated FAVOR Conformer (DF-Conformer) is a mask-based sequential model for single-channel speech enhancement (SE), integrating the Conformer block’s architectural motifs with both linear-complexity global attention and exponentially dilated depthwise convolution. Originating as an augmentation of Conv-TasNet’s time-dilated convolutional (TDCN) architectures, the DF-Conformer is characterized by its replacement of traditional multi-head self-attention (MHSA) with FAVOR+—a positive orthogonal random feature method—yielding linear time and memory complexity. Simultaneously, the block replaces standard convolution with exponentially dilated depthwise convolution to scale the local receptive field. This approach allows the architecture to efficiently model long-range dependencies, expanding the effective receptive field while maintaining tractable resource requirements. Empirical benchmarks show that the DF-Conformer achieves improved scale-invariant signal-to-noise ratio (SI-SNRi) and competitive real-time factors relative to established baselines. More recent investigations have critically analyzed the limitations of FAVOR+ and demonstrated that structured state-space sequence models (Hydra, a bidirectional extension of Mamba) can further improve performance while keeping linear complexity (Koizumi et al., 2021, Seki et al., 4 Nov 2025).
1. Architectural Design and Pipeline
The DF-Conformer’s processing pipeline accepts a raw audio sequence and proceeds as follows:
- Encoding: The audio is analyzed by a trainable encoder filterbank as in Conv-TasNet; typical parameters are a window size of 2.5 ms and a hop size of 1.25 ms, yielding an encoded matrix with .
- Mask Prediction: The encoded representation is input to a mask prediction network , comprised of stacked DF-Conformer blocks, which produces a time-frequency mask .
- Masking and Decoding: The mask is applied element-wise in the embedding space, and the masked representation is decoded using a learnable decoder with an overlap-add mechanism:
$y = \Dec(\Enc(x) \odot M(\Enc(x)))$
- The sole change introduced by DF-Conformer, relative to TDCN++ or Conv-TasNet, is the replacement of TDCN blocks in the mask-prediction network with DF-Conformer blocks $2106.15813$.
The mask-prediction network is specified recursively: - $Z^{0} = \text{Dense}_1(\Enc(x))$ - For : - Set dilation . - - Final mask: .
The stack uses blocks (with repeats, distinct dilation values).
2. DF-ConformerBlock: Internal Mechanisms
Each DF-ConformerBlock fuses local and global context through:
- Macaron-style Feed-Forward Integration: Two feed-forward modules, each applied with $1/2$ residual scaling before and after the other modules.
- Linear-Attention Module: Replaces full softmax MHSA with FAVOR+, providing global context at linear complexity.
- Dilated Depthwise Convolution: Depthwise 1-D convolution with exponentially increasing dilation , expanding the temporal receptive field as the block stack deepens.
The explicit structure is: 1. 2. 3. 4. 5. 6. 7.
Dilated convolution follows:
with kernel length –$5$ (as in TDCN++) and exponentially increasing .
3. FAVOR+ Attention: Formulation and Trade-offs
FAVOR+ is a linear-complexity approximation to softmax-attention in MHSA. The canonical softmax attention for queries, keys, and values is:
FAVOR+ replaces the softmax with a random-feature map :
with , and constructed from positive orthogonal random features.
- Complexity: FAVOR+ attention in DF-Conformer scales as time and memory.
- Limitations: Empirical results indicate two primary limitations:
- Blurring and Low-Rank Patterns: The random feature approximation can lead to flattened, less selective attention distributions, hindering fine-grained alignment.
- Semantic Confusion: Non-injectivity of allows semantically distinct queries to exhibit nearly identical attention rows, reducing feature space expressivity.
A plausible implication is that these approximations, while beneficial for efficiency, trade-off some representational focus compared to full softmax attention (Seki et al., 4 Nov 2025).
4. Computational Complexity and Empirical Performance
The DF-Conformer’s resource profile is characterized as follows:
| Model | Params (M) | SI-SNRi (dB) | RTF (CPU) |
|---|---|---|---|
| TDCN++ | 8.75 | 14.10 | 0.10 |
| Conv-Tasformer | 8.71 | 14.36 | 0.25 |
| DF-Conformer-8 | 8.83 | 14.43 | 0.13 |
| iTDCN++ | 17.6 | 14.84 | 0.22 |
| iDF-Conformer-8 | 17.8 | 15.28 | 0.26 |
| iDF-Conformer-12 | 37.0 | 15.93 | 0.46 |
Baseline Comparisons: DF-Conformer-8 outperforms TDCN++ (+0.33 dB SI-SNRi) at similar real-time factor (RTF) and parameter count.
- Scalability: Linear time scaling with sequence length is preserved for both dilated convolutional and FAVOR+ attention modules. This architecture thus supports processing of long utterances not tractable with quadratic-complexity attention.
- Ablations:
- Removing dilated convolution (“F-Conformer-8”) drops SI-SNRi by $0.62$ dB, highlighting the importance of both local and global modules.
- Substitute FAVOR+ with standard MHSA substantially increases computational cost, with only marginal SI-SNRi improvement (depending on model capacity).
5. Training Regimen and Hyperparameters
DF-Conformer models have been trained on large-scale noisy speech corpora (over 4 million examples; 3,396.8 hours) in a single-channel setup. Notable settings:
- Encoder/Decoder: Conv-TasNet filterbanks (2.5 ms window / 1.25 ms hop).
- Loss: Negative log-thresholded SNR,
with a joint speech/noise masking loss (0.8/0.2 weighting) and mixture-consistency projection.
- Optimizer: Adam (), clipping at $5.0$.
- LR Schedule: .
- Batching: 500k steps on 128 TPUv3 cores, batch size $512$; EMA decay $0.9999$.
Model instantiations (non-iterative):
- DF-Conformer-8: blocks, , , , $6$ heads.
- DF-Conformer-12: , , $8$ heads.
6. Extensions: Hydra and State-Space Mixing
Recent research identifies and addresses the random feature-induced limitations of FAVOR+, proposing a replacement in the form of bidirectional state-space sequence models (Seki et al., 4 Nov 2025):
- Hydra Module: Extends the (unidirectional) Mamba mixer to a bidirectional, selective, structured state-space model within the Conformer block. For input ,
- State-space recurrence (channel-wise):
with , , dynamically generated per input via a light gating network. - Matrix mixer view: Produces a semiseparable or quasiseparable linear transformation with explicit forward, backward, and diagonal (self-interaction) structure.
- Integration: Drop-in substitution for FAVOR+ in DF-Conformer block: same macaron architecture, with Hydra replacing call to MhsaFavorModule.
- Complexity: Retains per channel; typically much smaller than sequence length .
- Empirical Advantages: Hydra eliminates the blurring and semantic confusion of FAVOR+, preserving exactness and bidirectional mixing capacity.
- Speech Enhancement Results: On DAPS (Seki et al., 4 Nov 2025),
| Model | DNSMOS | UTMOS | Speaker Similarity | Content Acc. (%) |
|---|---|---|---|---|
| Softmax | 3.46 | 3.53 | 0.83 | 87.88 |
| FAVOR+ | 3.44 | 3.33 | 0.79 | 88.24 |
| Bi-Mamba | 3.44 | 3.27 | 0.81 | 88.04 |
| Hydra | 3.44 | 3.48 | 0.83 | 88.95 |
Hydra matches or exceeds softmax on most metrics and substantially outperforms FAVOR+ on content accuracy. This suggests the state-space approach delivers stronger sequential modeling while sustaining linear scaling. Robustness to longer sequence lengths is observed for Hydra versus significant degradation for softmax and mildly suboptimal but constant performance for FAVOR+.
7. Significance, Impact, and Outlook
The DF-Conformer demonstrates that integrating linear-complexity self-attention (FAVOR+) and exponentially dilated convolutional modules yields efficient, accurate speech enhancement on long and challenging utterances. The architecture enables practical, scalable, and trainable masking in both non-causal and iterative configurations while maintaining competitive or superior SI-SNRi relative to state-of-the-art convolutional baselines.
Limitations of random feature approximation in FAVOR+ (blurring, non-injectivity) have motivated the exploration of structured state-space models such as Hydra, which, while slightly more parameter-intensive, resolve these issues and advance the empirical performance frontier for generative SE architectures. The modularity of the design allows rapid substitution and scaling, setting the stage for further hybridization of efficient global operators in both speech and broader sequence modeling tasks.