Conformer-Based Architecture

Updated 25 March 2026

Conformer-based architecture is a hybrid neural network combining convolutional modules and transformer self-attention to capture both local and global dependencies in sequential data.
Each block integrates a Macaron-style feed-forward network, multi-head self-attention with relative positional encoding, and a convolution module for efficient and stable feature extraction.
Empirical studies show state-of-the-art results in various applications such as ASR, translation, and vision, validating its versatility and performance across domains.

Conformer-based Architecture

Conformer-based architectures are hybrid neural networks integrating convolutional neural networks (CNNs) with transformer self-attention to model both local and global dependencies in sequential data. First introduced for automatic speech recognition (ASR), the Conformer encoder–decoder design fuses Macaron-style (two half-step) feed-forward networks (FFN), multi-head self-attention (MHSA) with relative positional encoding, and a convolution module within each block, yielding parameter-efficient, robust sequence models. The architecture has since been widely adopted and extended for various speech, language, vision, and multimodal applications, achieving state-of-the-art results across diverse tasks due to its ability to capture fine-grained local structure and long-range dependencies in parallel (Gulati et al., 2020, Carvalho et al., 2023).

1. Core Architecture and Layer Structure

The canonical Conformer block applies the following sublayers in sequence, each wrapped with pre-norm residual connections:

First Feed-Forward Module (FFN1): A position-wise FFN with Swish or ReLU activation, applied with half-step residual scaling.
Multi-Head Self-Attention (MHSA): Scaled dot-product attention over the sequence with relative positional encoding. Each head computes:

$Q_h = W_h^Q H, \quad K_h = W_h^K H, \quad V_h = W_h^V H$

$\mathrm{Attn}_h(H) = \mathrm{Softmax}\!\left(\frac{Q_h K_h^\top}{\sqrt{d_k}} + R\right)V_h,$

with $R$ encoding relative positions. Outputs from all heads are concatenated and projected.

Convolution Module: Incorporates local context through a stack:
- Pointwise conv → GLU → depthwise conv → batch normalization → Swish activation → pointwise conv.
- The GLU split/gate enables channel-wise control.
Second Feed-Forward Module (FFN2): Same as FFN1, again with half-step residual.
Final LayerNorm: Applied at the end of the block before passing to the next block or decoder.

The decoder mirrors the stack with masked MHSA (target sequence), encoder-decoder attention, and two-layer FFN modules (Gulati et al., 2020, Guo et al., 2020, Carvalho et al., 2023).

2. Local and Global Feature Integration

The principal design insight of Conformer is the simultaneous modeling of local and global sequence features:

Self-Attention: Captures global (utterance- or sequence-level) dependencies, essential for context-aware recognition, translation, and generation tasks across arbitrary sequence lengths.
Convolution: Excels at modeling local (short-range, neighborhood) dependencies, enforcing smoothness and local consistency, crucial for capturing phonetic, spectral, or visual details.
Macaron Residuals: The sandwiching of self-attention and convolution between two half-step FFNs ensures both training stability (gradient flow) and representational flexibility (Gulati et al., 2020, Guo et al., 2020, Akram et al., 17 Feb 2025).

Empirically, ablations demonstrate the convolution module is critical for performance: removing it significantly degrades accuracy on ASR and other tasks (Gulati et al., 2020).

3. Architectural Variants and Extensions

A range of conformer variants have been proposed to improve efficiency, scalability, and domain adaptation:

Variant	Key Change/Addition	Application Domain/Benefit
Conformer-NTM	External neural Turing machine memory	Long-form ASR, robust long-context modeling (Carvalho et al., 2023)
E-Branchformer	Parallel MHSA–cgMLP branches, conv-based merge	Enhanced stability, cross-task generalization (Peng et al., 2023)
Squeezeformer	Temporal U-Net down/up–sampling, simplified blocks	Fewer FLOPs, improved WER (Kim et al., 2022)
Fast Conformer	Aggressive front-loaded downsampling, linear attention	2.8× speedup, scalable to 1B+ params (Rekesh et al., 2023)
Uconv-Conformer	U-Net–style skip/upsample, 16× sequence reduction	50% faster CPU, stable CTC (Andrusenko et al., 2022)
Skipformer	Dynamic CTC-based skip-and-recover gating	×22–31 compression, <same or better WER (Zhu et al., 2024)

In domains outside of speech, architectural modifications adapt the convolution module and input preprocessing for computer vision (e.g., Feature Coupling Unit for local-global fusion (Peng et al., 2021)), music (spectrogram and CRF decoding (Akram et al., 17 Feb 2025)), sign language recognition (spatio-temporal keypoints and convolutional subsampling (Elden, 3 Aug 2025)), and noisy sensor data (multi-head convolution-only “ConFormer” (Yella et al., 2021)).

4. Training, Optimization, and Loss Functions

Optimization: Standard training employs Adam or AdamW optimizers, often with learning rate schedules incorporating warmup and decay; additional tricks include SpecAugment, speed perturbation, and label smoothing.
Sequence and CTC loss: Hybrid CTC–Attention loss ( $\mathcal{L} = \alpha\,\mathcal{L}_{CTC} + (1-\alpha)\,\mathcal{L}_{Attn}$ ) balances monotonicity and flexible alignment (Carvalho et al., 2023, Guo et al., 2020).
Stability: Macaron residuals, attentive pooling, and regularization are central to stable training at depth.
Multi-task Losses: In multitask settings, task-specific heads, e.g., for segmentation or translation, are attached after the encoder or CSS blocks (Peng et al., 2023, Carvalho et al., 2023).

5. Empirical Performance and Applications

Conformer-based networks consistently outperform transformer-only and CNN-only models across ASR, ST, SLU, music chord recognition, image, and video tasks:

Speech Recognition: State-of-the-art WER on LibriSpeech (e.g., Conformer-L 2.1%/4.3% vs. Transformer 2.4%/5.6%; Squeezeformer further improves with fewer FLOPs) (Gulati et al., 2020, Kim et al., 2022).
Long-form ASR: Memory-augmented variants (e.g., Conformer-NTM) achieve up to 58.1% WER reduction on very long utterances (Carvalho et al., 2023).
Speaker Verification: Multi-scale feature aggregation (MFA-Conformer) and parameter transfer from ASR lead to EER at or below ECAPA-TDNN, with ~32% faster CPU inference (Zhang et al., 2022, Liao et al., 2022).
Translation and SLU: Conformer and E-Branchformer achieve top BLEU and SLU-F1 scores across multiple benchmarks, with stable convergence (Peng et al., 2023).
Speech Separation and Enhancement: Conformer/TasNet hybrids with linear attention outperform TDCN++ both in accuracy (SI-SNRi) and inference speed (Koizumi et al., 2021).
Vision, Music, Multimodal: Hybrid architectures generalize to ImageNet (Conformer-S exceeds DeiT-B by 1.6% top-1 at 2.5× fewer params) and ChordFormer excels at large-vocabulary MIREX triad tasks (Peng et al., 2021, Akram et al., 17 Feb 2025, Elden, 3 Aug 2025).

6. Implementation Strategies and Best Practices

Layer and Kernel Configuration: 12–16 encoder blocks, d_model=256–512, FFN expansion 4 $\times$ , convolution kernels (typically 15–31).
Input Preprocessing: Strided convolutional subsampling (common factor: 4–8) to accelerate computation and reduce attention cost.
Parameter Search: Neural architecture search (NAS) with DARTS/Dynamic Search Schedule finds block-wise hyperparameter diversity (e.g., variable head count, dilation) further improves recognition accuracy (Liu et al., 2021).
Multi-scale and Parallelism: Multi-scale aggregation and branch parallelism (E-Branchformer, multi-head convolution) are effective for robust training, especially in low-resource regimes or noisy settings (Peng et al., 2023, Yella et al., 2021).
Generalization: Attentive pooling and relative positional encoding enhance transfer to unseen tasks and domains (Liao et al., 2022).

7. Architectural Significance and Ongoing Developments

The conformer block is now the canonical backbone for end-to-end speech modeling and is increasingly generalized to new modalities. Its architectural principles—joint local-global modeling, stable Macaron residuals, and aggressive subsampling—address the trade-offs between context range, parameter efficiency, and hardware acceleration. Ongoing work includes scaling (Fast Conformer, 1B+ params (Rekesh et al., 2023)), dynamic computation (Skipformer (Zhu et al., 2024)), multimodal adaptation (vision, sign language (Peng et al., 2021, Elden, 3 Aug 2025)), and enhancement with external memory for long-form reasoning (Carvalho et al., 2023). The design space continues to expand, with convergent evidence that convolution–attention hybrids are broadly optimal for sequence modeling across domains.