Papers
Topics
Authors
Recent
Search
2000 character limit reached

MFA-Conformer: Multi-scale Feature Aggregation

Updated 13 May 2026
  • The paper demonstrates that systematic multi-scale feature aggregation significantly improves ASV and anti-spoofing performance with reduced EER compared to prior models.
  • The methodology integrates a Conformer encoder using convolutional subsampling and concatenates intermediate block outputs to capture both local and global cues.
  • Empirical results reveal up to a 21% relative EER reduction and faster inference, validating the efficiency and robustness of the MFA-Conformer design.

The Multi-scale Feature Aggregation Conformer (MFA-Conformer) is an architectural framework for sequential representation learning that employs Convolution-augmented Transformer (Conformer) blocks with explicit multi-scale feature aggregation. Distinguished by its systematic concatenation of intermediate outputs from all Conformer blocks prior to pooling and downstream classification, the MFA-Conformer delivers empirically validated gains in tasks that require robust extraction of local and global cues, including automatic speaker verification (ASV) and anti-spoofing countermeasures. The model's design facilitates efficient joint modeling of varying temporal context ranges, leading to superior recognition performance and efficiency compared to popular architectures such as ECAPA-TDNN and baseline Conformer systems (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).

1. Core Architectural Components

MFA-Conformer is built on the processing pipeline: input features → convolutional subsampling → a stack of Conformer blocks → multi-scale feature concatenation → pooling → linear projection for embedding.

  • Input Preprocessing: Spoken utterances are transformed into sequences of 80-dimensional log-Mel filterbank (FBANK) features, typically with a 25 ms window and 10 ms hop (Zhang et al., 2022, Dixit et al., 2024, Wang et al., 2023).
  • Convolutional Subsampling Layer: Two-dimensional (or sometimes one-dimensional) strided convolutions reduce temporal and/or frequency resolution to mitigate computational cost. For instance, a (2,2) strided 2D convolution halves the time and frequency dimensions (Zhang et al., 2022).
  • Conformer Encoder Stack: A sequence of LL Conformer blocks, each having dimension dd, jointly model global (self-attention) and local (convolution module) dependencies. MFA-Conformer typical configurations use L=6L=6 (ASV) or L=16L=16 (countermeasure) blocks (Zhang et al., 2022, Wang et al., 2023).
  • Multi-scale Feature Aggregation: Instead of outputting only from the last block, MFA-Conformer concatenates all block outputs (post-layernorm) along the feature/channel dimension to form a dense multi-scale representation. This captures feature hierarchies for downstream pooling and projection (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).
  • Pooling and Projection Head: The aggregated feature map undergoes attentive statistics pooling (ASP), yielding the weighted mean and standard deviation across time frames, followed by a fully connected layer (with batch normalization) to realize a fixed-dimensional speaker or artifact embedding (Zhang et al., 2022, Dixit et al., 2024, Wang et al., 2023).

2. Detailed Module Specifications

Conformer Block Construction

Each Conformer block consists of:

Multi-scale Aggregation Formulation

Let hiRd×Th_i \in \mathbb{R}^{d \times T'} denote the output from the ii-th block. The multi-scale aggregation step is:

Fagg=Concat(h1,h2,,hL)R(dL)×TF_{\text{agg}} = \operatorname{Concat}(h_1, h_2, \dots, h_L) \in \mathbb{R}^{(d \cdot L) \times T'}

followed by layer normalization and ASP pooling. Segment-level embedding vectors are produced by concatenating the mean and standard deviation across the time axis:

e=[μ~;σ~]R2(dL)e = [\tilde \mu; \tilde \sigma] \in \mathbb{R}^{2 \cdot (dL)}

where μ~\tilde \mu and dd0 are computed using learned attention weights over frames (Zhang et al., 2022, Dixit et al., 2024).

Embedding Projection and Loss Functions

The action of the pooling head and projection can be summarized as:

3. Empirical Performance and Model Comparison

Across ASV and anti-spoofing tasks, MFA-Conformer achieves state-of-the-art or near state-of-the-art performance with competitive parameter counts and superior inference speed.

Model/System #Params EER (VoxCeleb1-O, %) EER (FAD-Clean SEEN/UNSEEN, %) Inference RT Factor
ResNet34 23.2M 1.03 0.0088
ECAPA-TDNN 20.8M 0.82 0.0180
MFA-Conformer (1/2) 20.5M 0.64 0.0121
MFA-Conformer (ASR-pre) 13M 0.05 / 27.32
MFA-Conformer (ASV-pre) 13M 0.14 / 25.62
MFA-Conformer + MFCon loss 2.41†

†Best result with combined contrastive supervision and AM-Softmax (Dixit et al., 2024).

On ASV benchmarks, MFA-Conformer with 1/2 subsampling exhibits a 21% relative equal error rate (EER) reduction versus ECAPA-TDNN, while being 32% faster in inference (Zhang et al., 2022). For anti-spoofing, MFA-Conformer with ASR pretraining attains 0.05% EER (SEEN) and 27.32% EER (UNSEEN) on FAD-clean, outperforming baseline SSL and CNN models (Wang et al., 2023).

Ablation studies consistently show that removing multi-scale aggregation or the convolution module causes the greatest EER increase, confirming their essential contributions (Zhang et al., 2022, Wang et al., 2023).

4. Application Domains and Transfer Strategies

While originally proposed for speaker verification, the MFA-Conformer has been adapted for anti-spoofing countermeasures and demonstrates robust cross-lingual generalization (Wang et al., 2023). In this context:

  • Pretraining the Conformer encoder with CTC-based ASR or AAM-Softmax-based ASV improves downstream anti-spoofing robustness.
  • Fine-tuning proceeds via a two-stage strategy: freezing the encoder to train the classifier head, then joint training, stabilizing convergence for small target datasets.
  • The model achieves competitive error rates across multiple corpora, including FAD (Chinese) and ASVspoof-2019/2021 (English) (Wang et al., 2023).

5. Recent Extensions: Contrastive Multi-Scale Supervision

Building on the foundation established by multi-scale aggregation, subsequent work has introduced multi-scale feature contrastive (MFCon) loss functions to further enhance intermediate feature discriminability (Dixit et al., 2024). This involves:

  • Applying supervised contrastive loss to blockwise embeddings dd4 post-pooling.
  • The total loss becomes a combination of AM-Softmax applied to the final embedding and averaged supervised contrastive losses across all blocks.
  • Ablations show that separate pooling and projection for each block achieves the best EER; sharing parameters degrades performance.
  • The combined use of MFCon and AM-Softmax on the concatenated speaker embedding yields an additional 9.05% relative EER improvement compared to MFA-Conformer baseline on VoxCeleb1-O.

6. Analytic and Ablation Results

Key ablation outcomes on VoxCeleb1-O or FAD-clean:

  • Eliding multi-scale aggregation nearly doubles EER (+94%).
  • Removing the convolution module results in >100% EER increase.
  • Substituting Conformer blocks with Transformer blocks (removing convolution) degrades EER from 1.29% to 2.50% on SITW.Dev.
  • For anti-spoofing, concatenating all block outputs (16 in total) yields 20–30% lower pooled EER than using only the final block, indicating diverse depthwise representations capture complementary cues (Wang et al., 2023).

7. Future Research Directions

Anticipated advancements include:

  • Streaming-capable and causal MFA-Conformer variants for low-latency inference.
  • Integration of domain-adversarial adaptation to bolster robustness across deployment conditions.
  • Exploration of quantized and lightweight models for edge/on-device deployment in ASV and anti-spoofing.
  • Further research on fusion with raw waveform models, leveraging attack-wise error tendency (ET) statistics to guide model combination for forensic or adversarial contexts (Zhang et al., 2022, Wang et al., 2023).

The MFA-Conformer paradigm establishes a principled multi-scale sequence modeling approach for audio, validated across speaker verification and anti-spoofing tasks, and motivates continued investigation into depthwise aggregation and intermediate supervision for sequential embedding extraction (Zhang et al., 2022, Wang et al., 2023, Dixit et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-scale Feature Aggregation Conformer (MFA-Conformer).