MFA-Conformer: Multi-scale Feature Aggregation
- The paper demonstrates that systematic multi-scale feature aggregation significantly improves ASV and anti-spoofing performance with reduced EER compared to prior models.
- The methodology integrates a Conformer encoder using convolutional subsampling and concatenates intermediate block outputs to capture both local and global cues.
- Empirical results reveal up to a 21% relative EER reduction and faster inference, validating the efficiency and robustness of the MFA-Conformer design.
The Multi-scale Feature Aggregation Conformer (MFA-Conformer) is an architectural framework for sequential representation learning that employs Convolution-augmented Transformer (Conformer) blocks with explicit multi-scale feature aggregation. Distinguished by its systematic concatenation of intermediate outputs from all Conformer blocks prior to pooling and downstream classification, the MFA-Conformer delivers empirically validated gains in tasks that require robust extraction of local and global cues, including automatic speaker verification (ASV) and anti-spoofing countermeasures. The model's design facilitates efficient joint modeling of varying temporal context ranges, leading to superior recognition performance and efficiency compared to popular architectures such as ECAPA-TDNN and baseline Conformer systems (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).
1. Core Architectural Components
MFA-Conformer is built on the processing pipeline: input features → convolutional subsampling → a stack of Conformer blocks → multi-scale feature concatenation → pooling → linear projection for embedding.
- Input Preprocessing: Spoken utterances are transformed into sequences of 80-dimensional log-Mel filterbank (FBANK) features, typically with a 25 ms window and 10 ms hop (Zhang et al., 2022, Dixit et al., 2024, Wang et al., 2023).
- Convolutional Subsampling Layer: Two-dimensional (or sometimes one-dimensional) strided convolutions reduce temporal and/or frequency resolution to mitigate computational cost. For instance, a (2,2) strided 2D convolution halves the time and frequency dimensions (Zhang et al., 2022).
- Conformer Encoder Stack: A sequence of Conformer blocks, each having dimension , jointly model global (self-attention) and local (convolution module) dependencies. MFA-Conformer typical configurations use (ASV) or (countermeasure) blocks (Zhang et al., 2022, Wang et al., 2023).
- Multi-scale Feature Aggregation: Instead of outputting only from the last block, MFA-Conformer concatenates all block outputs (post-layernorm) along the feature/channel dimension to form a dense multi-scale representation. This captures feature hierarchies for downstream pooling and projection (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).
- Pooling and Projection Head: The aggregated feature map undergoes attentive statistics pooling (ASP), yielding the weighted mean and standard deviation across time frames, followed by a fully connected layer (with batch normalization) to realize a fixed-dimensional speaker or artifact embedding (Zhang et al., 2022, Dixit et al., 2024, Wang et al., 2023).
2. Detailed Module Specifications
Conformer Block Construction
Each Conformer block consists of:
- Pre-norm Macaron-style feed-forward modules split before and after the main sequence operations, each with half-step residuals.
- Multi-Head Self-Attention (MHSA) with relative positional encodings, using 4 heads and scaled dot-product for , typically 44–256.
- Convolution module (pointwise convolution + GLU, followed by depthwise convolution and batch normalization, then Swish nonlinearities).
- All sub-modules are wrapped with residual connections and layer normalization (Zhang et al., 2022, Dixit et al., 2024, Wang et al., 2023).
Multi-scale Aggregation Formulation
Let denote the output from the -th block. The multi-scale aggregation step is:
followed by layer normalization and ASP pooling. Segment-level embedding vectors are produced by concatenating the mean and standard deviation across the time axis:
where and 0 are computed using learned attention weights over frames (Zhang et al., 2022, Dixit et al., 2024).
Embedding Projection and Loss Functions
The action of the pooling head and projection can be summarized as:
- Each pooled vector is batch-normalized and linearly projected to dimensionality (typically 1 for speaker embeddings) (Zhang et al., 2022, Dixit et al., 2024).
- A variant aggregates blockwise pooled vectors 2 into an 3 vector for additional projection (Dixit et al., 2024).
- Losses include Additive Margin Softmax (AM-Softmax) (Zhang et al., 2022, Dixit et al., 2024), with options for blockwise supervision using contrastive loss variants (see Section 5).
3. Empirical Performance and Model Comparison
Across ASV and anti-spoofing tasks, MFA-Conformer achieves state-of-the-art or near state-of-the-art performance with competitive parameter counts and superior inference speed.
| Model/System | #Params | EER (VoxCeleb1-O, %) | EER (FAD-Clean SEEN/UNSEEN, %) | Inference RT Factor |
|---|---|---|---|---|
| ResNet34 | 23.2M | 1.03 | — | 0.0088 |
| ECAPA-TDNN | 20.8M | 0.82 | — | 0.0180 |
| MFA-Conformer (1/2) | 20.5M | 0.64 | — | 0.0121 |
| MFA-Conformer (ASR-pre) | 13M | — | 0.05 / 27.32 | — |
| MFA-Conformer (ASV-pre) | 13M | — | 0.14 / 25.62 | — |
| MFA-Conformer + MFCon loss | — | 2.41† | — | — |
†Best result with combined contrastive supervision and AM-Softmax (Dixit et al., 2024).
On ASV benchmarks, MFA-Conformer with 1/2 subsampling exhibits a 21% relative equal error rate (EER) reduction versus ECAPA-TDNN, while being 32% faster in inference (Zhang et al., 2022). For anti-spoofing, MFA-Conformer with ASR pretraining attains 0.05% EER (SEEN) and 27.32% EER (UNSEEN) on FAD-clean, outperforming baseline SSL and CNN models (Wang et al., 2023).
Ablation studies consistently show that removing multi-scale aggregation or the convolution module causes the greatest EER increase, confirming their essential contributions (Zhang et al., 2022, Wang et al., 2023).
4. Application Domains and Transfer Strategies
While originally proposed for speaker verification, the MFA-Conformer has been adapted for anti-spoofing countermeasures and demonstrates robust cross-lingual generalization (Wang et al., 2023). In this context:
- Pretraining the Conformer encoder with CTC-based ASR or AAM-Softmax-based ASV improves downstream anti-spoofing robustness.
- Fine-tuning proceeds via a two-stage strategy: freezing the encoder to train the classifier head, then joint training, stabilizing convergence for small target datasets.
- The model achieves competitive error rates across multiple corpora, including FAD (Chinese) and ASVspoof-2019/2021 (English) (Wang et al., 2023).
5. Recent Extensions: Contrastive Multi-Scale Supervision
Building on the foundation established by multi-scale aggregation, subsequent work has introduced multi-scale feature contrastive (MFCon) loss functions to further enhance intermediate feature discriminability (Dixit et al., 2024). This involves:
- Applying supervised contrastive loss to blockwise embeddings 4 post-pooling.
- The total loss becomes a combination of AM-Softmax applied to the final embedding and averaged supervised contrastive losses across all blocks.
- Ablations show that separate pooling and projection for each block achieves the best EER; sharing parameters degrades performance.
- The combined use of MFCon and AM-Softmax on the concatenated speaker embedding yields an additional 9.05% relative EER improvement compared to MFA-Conformer baseline on VoxCeleb1-O.
6. Analytic and Ablation Results
Key ablation outcomes on VoxCeleb1-O or FAD-clean:
- Eliding multi-scale aggregation nearly doubles EER (+94%).
- Removing the convolution module results in >100% EER increase.
- Substituting Conformer blocks with Transformer blocks (removing convolution) degrades EER from 1.29% to 2.50% on SITW.Dev.
- For anti-spoofing, concatenating all block outputs (16 in total) yields 20–30% lower pooled EER than using only the final block, indicating diverse depthwise representations capture complementary cues (Wang et al., 2023).
7. Future Research Directions
Anticipated advancements include:
- Streaming-capable and causal MFA-Conformer variants for low-latency inference.
- Integration of domain-adversarial adaptation to bolster robustness across deployment conditions.
- Exploration of quantized and lightweight models for edge/on-device deployment in ASV and anti-spoofing.
- Further research on fusion with raw waveform models, leveraging attack-wise error tendency (ET) statistics to guide model combination for forensic or adversarial contexts (Zhang et al., 2022, Wang et al., 2023).
The MFA-Conformer paradigm establishes a principled multi-scale sequence modeling approach for audio, validated across speaker verification and anti-spoofing tasks, and motivates continued investigation into depthwise aggregation and intermediate supervision for sequential embedding extraction (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).