Conformer Blocks in Deep Neural Networks

Updated 26 May 2026

Conformer blocks are neural network modules that integrate convolutional operations with multi-head self-attention to model both local and global dependencies in speech and audio processing.
They employ macaron-style dual feed-forward networks with pre-layer normalization and residual connections to ensure stable deep stacking.
Their design has led to state-of-the-art gains in ASR, speaker verification, and speech enhancement, balancing efficiency and performance.

A Conformer block is a neural network architecture that integrates convolutional operations with multi-head self-attention in a Transformer-style deep sequence model. Originally introduced for speech processing, the Conformer block enables joint modeling of local (via convolution) and global (via self-attention) dependencies, with macaron-style feed-forward networks providing additional nonlinear transformation. It is the canonical backbone of recent state-of-the-art systems in automatic speech recognition (ASR), speaker verification, speech enhancement, and source separation.

1. Block Composition and Sub-layer Structure

The canonical Conformer block, as established in (Guo et al., 2020, Sinha et al., 2022, Cao et al., 2022), exhibits the following sequential structure, with pre-normalization and residual connections at each sub-layer:

First half-step Feed-Forward (FFN), scaled by ½ and wrapped in a residual connection.
Multi-Head Self-Attention (MHSA), optionally incorporating relative positional encoding.
Convolutional Module: pointwise (1×1) convolution for expansion, Gated Linear Unit (GLU) gating, depthwise separable convolution (typically kernel size 15–31), nonlinearity (Swish), batch normalization, and projection back to original dimension.
Second half-step Feed-Forward (FFN), again scaled by ½ with residual.
Final Layer Normalization.

The mathematical sequence for each block is:

$\begin{aligned} & X_1 = X_0 + \tfrac{1}{2} \cdot \mathrm{Dropout}(\mathrm{FFN}(\mathrm{LayerNorm}(X_0))) \ & X_2 = X_1 + \mathrm{Dropout}(\mathrm{MHSA}(\mathrm{LayerNorm}(X_1))) \ & X_3 = X_2 + \mathrm{Dropout}(\mathrm{ConvModule}(\mathrm{LayerNorm}(X_2))) \ & X_4 = X_3 + \tfrac{1}{2} \cdot \mathrm{Dropout}(\mathrm{FFN}(\mathrm{LayerNorm}(X_3))) \ & Y = \mathrm{LayerNorm}(X_4) \end{aligned}$

This arrangement enables effective stacking (typical block counts range from 4 to 24) without destabilizing training.

2. Mathematical Formulation of Core Sub-layers

Feed-Forward Network (FFN):
- Position-wise, two-layer, with Swish activation: $\textrm{Swish}(x) = x \cdot \sigma(x)$ .
- Expansion factor typically 4× (e.g., $D=512 \rightarrow D_{\textrm{ff}}=2048$ ).
Multi-Head Self-Attention (MHSA):
- Computes:
$\textrm{head}_i = \textrm{Softmax} \left( \frac{Q_i K_i^\top}{\sqrt{d_k}} \right) V_i$

with $Q_i = XW^Q_i$ , $K_i = XW^K_i$ , $V_i = XW^V_i$ . - Output: concatenate heads, project to model dimension via $W^O$ . - Head count commonly $h=4$ or $h=8$ . - Can use relative positional encoding as in (Zhang et al., 2022, Guo et al., 2020).
Convolutional Module:
- Pointwise 1×1 convolution: input $\textrm{Swish}(x) = x \cdot \sigma(x)$ 0, output channels $\textrm{Swish}(x) = x \cdot \sigma(x)$ 1, then GLU gating.
- Depthwise convolution: large kernel (typically 15 or 31), optionally dilated in advanced variants (Koizumi et al., 2021).
- Batch normalization, then nonlinearity (Swish), projection back via 1×1 conv, optional dropout.

All sub-modules are preceded by layer normalization (pre-norm design) as found necessary for deep stacking and training stability.

3. Design Rationales and Architectural Variants

The Conformer block was designed to capture:

Local Context: Depthwise convolutions efficiently encode short-range, fine-grained sequential dependencies (e.g., phonetics, micro-temporal cues).
Global Context: MHSA modules model long-range structural and semantic dependencies.
Nonlinear Re-mixing: Macaron-style dual FFNs before and after context modules enrich the expressive capacity.
Stable Stacking: LayerNorm and residuals on every sub-layer counteract vanishing/exploding gradients and enable deep networks.

Variants are motivated by efficiency and task fit:

Variant	Modification	Typical Setting
Vanilla Conformer (Guo et al., 2020)	Standard block; softmax MHSA; $\textrm{Swish}(x) = x \cdot \sigma(x)$ 2 conv	ASR, SV, enhancement
DF-Conformer (Koizumi et al., 2021)	FAVOR+ linear-attention; dilated convs	Speech enhancement
Darts-Conformer (Shi et al., 2021)	NAS-discovered cell: composition and connections learned	End-to-end ASR
MFA-Conformer (Zhang et al., 2022)	Convolutional subsampling; multi-scale aggregation	Speaker verification
Deep Sparse Conformer (Wu, 2022)	ProbSparse attention; DeepNorm scaling for $\textrm{Swish}(x) = x \cdot \sigma(x)$ 3100 layers	Long-sequence ASR
Practical Conformer (Botros et al., 2023)	Lower conv-only blocks; RNN-Attention-Performer; downsized	On-device/cloud ASR
CMGAN, DCUC-Net (Cao et al., 2022, Ahmed et al., 2023)	2-stage or multimodal (audio–visual) input handling	TF SE, audio-visual SE

4. Implementation Details and Hyperparameter Choices

Typical configuration parameters found in published work:

Model/attention dimension ( $\textrm{Swish}(x) = x \cdot \sigma(x)$ 4): 256–512.
Expansion factor in FFN: 4.
#Heads: 4 (256-dim) or 8 (512-dim); head size $\textrm{Swish}(x) = x \cdot \sigma(x)$ 5.
Convolution kernel size: 15–31 (non-causal for ASR, causal for streaming), depthwise with per-channel separation.
Residual scaling: FFN outputs scaled by ½ for stability.
Dropout: 0.1.
Convolutional subsampling front-end: 2D convolution, stride=2–4, channels=256.

Some advanced variants use dilated convs (receptive field grows exponentially, e.g., DF-Conformer), mixtures of conv and attention (Practical Conformer), or replace softmax-based MHSA with linear-complexity approximations (FAVOR+, Performer).

5. Application Context and Task-Specific Adaptations

Conformer blocks are extensively applied in tasks with structured sequential data:

Speech Recognition: Main encoder backbone in end-to-end ASR, either as backbone (Guo et al., 2020, Shi et al., 2021, Botros et al., 2023) or as a drop-in for specialized encoders (e.g., Conv-TasNet upgrades (Koizumi et al., 2021)).
Speaker Verification/Extraction: Blocks support flexible speaker conditioning at each layer, via embedding concatenation (Sinha et al., 2022, Zhang et al., 2022).
Speech Enhancement/Separation: Used to capture both cross-channel and local context (Cao et al., 2022, Ahmed et al., 2023).
Audio-Visual Modeling: Input fusion and multimodal attention for audio-visual speech enhancement, with concatenated feature streams as conformer input (Ahmed et al., 2023).
Efficient On-Device Deployment: Block variants combine convolution-only layers or linear attention for memory and speed constraints (Botros et al., 2023).

Task-specific architectural variations (e.g., sequence directionality, kernel size, scaling) are adjusted to meet domain and latency requirements.

6. Empirical Performance and Ablation Studies

Conformer blocks consistently yield significant empirical gains over standard Transformer or CNN-only models:

SI-SDR improvement: TCN-Conformer block raised target speaker extraction SI-SDR by +2.64 dB over Conformer-FFN on 2-mix (Sinha et al., 2022).
WER reduction: Practical Conformer design yielded 6.8× speedup with <1.2% absolute WER increase for on-device ASR (Botros et al., 2023).
Speaker Verification: MFA-Conformer achieved EER=0.64%, outperforming ECAPA-TDNN (Zhang et al., 2022).
Ablations: The convolution module, macaron FFN, and multi-scale aggregation all yielded 10–55% relative improvements in EER, confirming the centrality of each module (Zhang et al., 2022).
Scalability: DeepNorm scaling allowed stacking to 100 layers with stable convergence (Wu, 2022).

7. Efficiency, Scaling, and Recent Innovations

Computational cost: Baseline block is bottlenecked by MHSA ( $\textrm{Swish}(x) = x \cdot \sigma(x)$ 6); variants with linear-complexity attention (FAVOR+, Performer) yield $\textrm{Swish}(x) = x \cdot \sigma(x)$ 7 with minimal accuracy loss (Koizumi et al., 2021, Botros et al., 2023).
Memory: Largest states are in attention maps; conv-only or downscaled-FFN layers cut parameter count and runtime.
Scaling to depth: DeepNorm and similar strategies rescale residuals and insert layernorms post-residual, which enables stacking to extreme depths ( $\textrm{Swish}(x) = x \cdot \sigma(x)$ 8–100) with stable gradients and performance (Wu, 2022).
Architecture search: Darts-Conformer leverages NAS/DARTS to optimize sublayer ordering, connection, and parameterization, yielding architectures outperforming hand-crafted conformer stacks (Shi et al., 2021).

Taken together, the Conformer block’s modular but highly interleaved construction is central to modern sequence modeling architectures in speech and audio, balancing global and local pattern extraction, and enabling efficient, stable training at scale.