Conv-Former: A Convolutional Transformer Hybrid
- Conv-Former is a hybrid design motif that integrates convolution with Transformer or Conformer blocks for diverse applications.
- It uses convolution as a token mixer, feed-forward refiner, and dynamic gating mechanism in vision, speech, recommendation, and medical tasks.
- Empirical results show enhanced efficiency, reduced parameters, and improved performance, making it a versatile architecture across domains.
Searching arXiv for the named paper and closely related ConvFormer variants to ground the article in current arXiv records. arXiv search query: (Vaishnav et al., 2022) Conviformers ConvFormer “Conv-Former” denotes a heterogeneous line of arXiv architectures that combine convolutional operators with Transformer or Conformer-style blocks rather than a single standardized model. Across the papers collected under this label, convolution is used as a token mixer, as a generator of projections, as a feed-forward refinement operator, as a dynamic gate on convolutional weights, or as a replacement for selected attention layers. The resulting models span fine-grained plant classification, point-cloud segmentation and scene flow, sequential recommendation, single-image super-resolution, monocular 3D human pose estimation, medical image segmentation, speech enhancement, and automatic speech recognition (Vaishnav et al., 2022, Wu et al., 2022, Wang et al., 2023, Wu et al., 2024, Diaz-Arias et al., 2023, Lin et al., 2023, Gu et al., 2022, Koizumi et al., 2021, Botros et al., 2023, Prabhu et al., 2024). This suggests that “Conv-Former” is best understood as a recurring hybrid design motif: preserve or reintroduce convolutional inductive bias while retaining some of the long-range modeling, residual structure, and modularity associated with Transformer-family networks.
1. Scope and nomenclature
The term appears in several closely related spellings—“Conviformer,” “ConvFormer,” “PointConvFormer,” “DF-Conformer,” and “Multi-Convformer”—and these names refer to architectures that are similar in intent but not identical in mechanism. In the vision literature, “Conviformers: Convolutionally guided Vision Transformer” introduces a convolutional transformer for fine-grained categorization of plants from herbarium sheets and pairs it with the PreSizer preprocessing method (Vaishnav et al., 2022). In recommendation, “ConvFormer: Revisiting Transformer for Sequential User Modeling” replaces self-attention with a Light Temporal Convolutional Network layer (Wang et al., 2023). In super-resolution, “Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach” uses a large-kernel convolutional feature mixer and an edge-preserving feed-forward network (Wu et al., 2024). In 3D human pose estimation, “ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention” retains scaled dot-product attention but makes the projections convolutional and multi-scale (Diaz-Arias et al., 2023).
| Variant | Domain | Characteristic mechanism |
|---|---|---|
| “Conviformers: Convolutionally guided Vision Transformer” (Vaishnav et al., 2022) | Fine-grained plant classification | Higher-resolution handling and PreSizer |
| “PointConvFormer: Revenge of the Point-based Convolution” (Wu et al., 2022) | Point-cloud segmentation and scene flow | Point convolution re-weighted by feature-difference attention |
| “ConvFormer: Revisiting Transformer for Sequential User Modeling” (Wang et al., 2023) | Next-item prediction | Depth-wise convolution + channel-wise convolution with FFT acceleration |
| “Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach” (Wu et al., 2024) | Lightweight SISR | Large-kernel mixer + edge-preserving FFN |
| “ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation...” (Diaz-Arias et al., 2023) | Monocular 3D pose estimation | Dynamic multi-headed convolutional self-attention |
| “ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation” (Lin et al., 2023) | Medical segmentation | Pooling + CNN-style self-attention + convolutional FFN |
| “ConvFormer: Combining CNN and Transformer for Medical Image Segmentation” (Gu et al., 2022) | Medical segmentation | Hierarchical hybrid stem + Enhanced DeTrans |
| “DF-Conformer” (Koizumi et al., 2021), “Practical Conformer” (Botros et al., 2023), “Multi-Convformer” (Prabhu et al., 2024) | Speech enhancement and ASR | Linear attention, conv-only lower blocks, or multiple convolution kernels |
The nomenclature is therefore broad rather than canonical. A plausible implication is that the name signals a design philosophy—convolution-guided Transformer-family modeling—more than a fixed blueprint.
2. Recurrent architectural motifs
A persistent motif is the replacement or restructuring of the Transformer token mixer while preserving the residual, normalization, and feed-forward scaffolding. In sequential user modeling, ConvFormer removes the projections and Softmax attention entirely and replaces the standard block with a depth-wise convolution (DWC) plus channel-wise convolution (CWC) sublayer, called the Light Temporal Convolutional Network layer (Wang et al., 2023). In lightweight super-resolution, the ConvFormer layer similarly replaces multi-head self-attention with a large-kernel depth-wise convolutional gate, while the FFN is reworked into an edge-preserving module (Wu et al., 2024). These designs make the “mixer + MLP” decomposition explicit without insisting on attention as the mixer.
A second motif is to keep attention but make it convolutional in how queries, keys, or values are formed. The 3D pose ConvFormer generates , , and via small $1$D convolutions along the sequence dimension and aggregates multiple kernel sizes with learned nonnegative weights (Diaz-Arias et al., 2023). The medical segmentation ConvFormer with CNN-style self-attention projects , , and by convolutions over 0D feature maps, constructs self-attention matrices as convolution kernels with adaptive sizes, and follows them with a convolutional feed-forward network (Lin et al., 2023). The hierarchical medical segmentation ConvFormer based on Enhanced DeTrans inserts a depth-wise convolution into the Deformable Transformer feed-forward module and combines convolutional and deformable-attention branches in a residual-shaped hybrid stem (Gu et al., 2022).
A third motif reverses the direction of influence: attention modulates convolution rather than replacing it. PointConvFormer preserves the point-convolution operator and uses an attention score based on feature difference between points in the neighborhood to modify the convolutional weights at each point:
1
In the paper’s interpretation, this preserves the invariances from point convolution while using attention to select relevant points in the neighborhood for convolution (Wu et al., 2022).
A fourth motif is explicit complexity control. ConvFormer for recommendation accelerates full-sequence DWC using the convolution theorem,
2
reducing the DWC from 3 to 4 (Wang et al., 2023). DF-Conformer replaces quadratic self-attention by FAVOR+ linear attention and pairs it with stacked 5-D dilated depthwise convolution layers (Koizumi et al., 2021). The optimized ASR Conv-Former replaces lower Conformer blocks with convolution-only blocks and uses an RNNAttention-Performer to reduce latency (Botros et al., 2023).
3. Vision, super-resolution, and medical segmentation
In fine-grained plant classification, “Conviformers” begins from the observation that fine-grained tasks require discovery of subtle differences between highly similar sub-classes and that such distinctions are often lost when images are downscaled to save memory and computational cost associated with vision transformers (Vaishnav et al., 2022). The model is presented as a convolutional transformer architecture that, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost. The same work introduces PreSizer, described as a novel, improved pre-processing technique to resize images better while preserving their original aspect ratios, which proved essential for classifying natural plants, and reports SoTA on Herbarium 202x and iNaturalist 2019 (Vaishnav et al., 2022).
In lightweight single-image super-resolution, ConvFormer is instantiated as the core layer of the CFSR network. The large-kernel mixer computes
6
followed by
7
with 8 chosen as a good trade-off between receptive field and cost (Wu et al., 2024). The same paper contrasts global self-attention, local window self-attention, and the large-kernel mixer by complexity, and introduces the edge-preserving feed-forward network, whose depth-wise branch combines a learnable 9 convolution with fixed Sobel and Laplacian filters through learnable Softmax gates. With 0 residual ConvFormer blocks, channel width 1, and 2 ConvFormer layers per block, the network has about 3 K parameters and 4 G FLOPs for 5 SR. On Urban100, CFSR achieves PSNR/SSIM 6 versus 7 of ShuffleMixer, a gain of 8 dB, while reducing parameters by 9 and FLOPs by 0 (Wu et al., 2024).
Medical image segmentation contains two distinct ConvFormer lines. The plug-and-play CNN-style Transformer variant operates directly on 1D feature maps through Pooling, CNN-Style Self-Attention (CSA), and a Convolutional Feed-Forward Network (CFFN) (Lin et al., 2023). Its stated motivation is attention collapse: on relatively limited well-annotated medical image data, attention maps can become similar or even identical. CSA forms an unnormalized cosine-similarity map
2
then multiplies it by a learnable Gaussian mask to obtain an adaptive convolution kernel 3 over the value map (Lin et al., 2023). Across SETR, TransUNet, TransFuse, FAT-Net, and Patcher, the module yields consistent performance gains; for example, SETR on ACDC improves from Dice 4 and HD 5 to Dice 6 and HD 7, and across all backbones and datasets the reported gains are 8–9 Dice and 0–1 pt HD reduction (Lin et al., 2023).
The hierarchical ConvFormer for medical image segmentation instead adopts a U-shaped encoder-decoder architecture built from a shallow Conv stem, three residual-shaped hybrid stems, an additional multi-scale Enhanced DeTrans encoder, and a symmetric decoder (Gu et al., 2022). Enhanced DeTrans retains multi-scale deformable self-attention but re-designs the feed-forward module by inserting a 2 or 3 depth-wise convolution, while Enhanced Positional Encoding adds a learnable DWConv branch to sinusoidal positional encoding. On MM-WHS CT, this ConvFormer reports MeanDice 4 with 5 M parameters, compared with 6 for UNETR and 7 for CoTr. On a lymph node ultrasound dataset it reports IoU 8 and F1 9, and on ISIC skin lesion segmentation it reports Jaccard $1$0 and Dice $1$1 (Gu et al., 2022).
4. Sequential, geometric, and pose modeling
For sequential user modeling in recommender systems, ConvFormer is explicitly derived from an empirical analysis of self-attention in next-item prediction. The paper identifies three essential criteria for an effective token mixer: order sensitivity, large receptive field, and lightweight architecture (Wang et al., 2023). The resulting model is a standard two-tower next-item model in which each Transformer-style block is replaced by a depth-wise convolution along the time axis and a channel-wise convolution per time step. After $1$2 stacked LighTCN layers, the final user representation is $1$3, the score for a candidate item is $1$4, and training uses a pairwise ranking loss (Wang et al., 2023). On four public datasets—Amazon-Beauty, Sports, Toys, and Yelp—with $1$5, $1$6, and $1$7 blocks, ConvFormer consistently achieves the highest Hit@5/10 and MRR on $1$8-vs-$1$9 tests, improves MRR by 0–1 relative over the best Transformer baseline and by 2–3 over FMLP-Rec, and, in the FFT-accelerated ConvFormer-F version, gives a 4–5 speedup over SASRec when 6 is large (Wang et al., 2023).
PointConvFormer targets point-cloud segmentation and scene-flow estimation by combining point convolution, where filter weights are only based on relative position, with Transformer-style feature-based attention (Wu et al., 2022). Its theoretical motivation is tied to generalization: attention filters out neighbors whose feature difference is large, while the point-convolution part preserves translation- and rotation-invariant geometric priors. On ScanNet semantic segmentation at a 7 cm grid, PointConvFormer reports 8 mIoU / 9 ms / 0 M parameters, compared with 1 / 2 ms / 3 M for a PointConv bottleneck re-implementation and 4 / 5 ms / 6 M for MinkowskiNet42. On SemanticKITTI it reports 7 mIoU, exceeding RandLA-Net at 8, MinkowskiNet at 9, and SPVNAS at 0. For scene flow, replacing PointConv with PointConvFormer in PointPWC-Net reduces EPE3D from 1 m / 2 m to 3 m / 4 m on FlyingThings3D / KITTI, described as 5 error reduction (Wu et al., 2022).
In monocular 3D human pose estimation, ConvFormer follows a two-stage lift pipeline: a spatial ConvFormer first models human joint relations within individual frames, then a temporal ConvFormer fuses all frame embeddings across time to predict the 6D pose of the middle frame (Diaz-Arias et al., 2023). The central operator is dynamic multi-headed convolutional self-attention, in which each head forms 7, 8, and 9 by small 0D convolutions and then aggregates multiple kernel sizes with learned nonnegative weights that sum to one. The temporal version introduces the temporal joints profile, in which each token already “sees” a 1-sized temporal neighborhood of joint features before attention weights are computed (Diaz-Arias et al., 2023). Parameter reduction is a primary design goal: for a 2-frame sequence and 3 joints, ConvFormer uses approximately 4 M parameters versus approximately 5 M for MHFormer, a 6 reduction; the 7-frame variant uses 8 M versus 9 M, an 00 saving; and at 01 it uses 02 M versus approximately 03 M for PoseFormer, a 04 cut (Diaz-Arias et al., 2023). On Human3.6M with CPN inputs and 05, it reports 06 mm average MPJPE under Protocol I, 07 mm under Protocol II, and a 08 reduction in velocity error under Protocol III versus the previous SOTA. On MPI-INF-3DHP with 09, it reports 10 PCK, 11 AUC, and 12 mm MPJPE (Diaz-Arias et al., 2023).
5. Speech enhancement and automatic speech recognition
In speech enhancement, DF-Conformer integrates Conv-TasNet and Conformer by using a Conformer-style mask-prediction network with linear-complexity FAVOR+ attention and 13-D dilated depthwise convolutions (Koizumi et al., 2021). The encoder is a learned 14-D convolutional analysis filterbank with window 15 ms, hop 16 ms, and output dimension 17, followed by an 18-layer DF-Conformer mask network and an overlap-add synthesis decoder. Each block applies half-step FFN, MHSA_FAVOR, GLU, dilated depthwise convolution, BatchNorm, Swish, pointwise dense projection, dropout, and a second half-step FFN, then LayerNorm (Koizumi et al., 2021). The model was trained on 19 hours of noisy speech data. In the reported comparisons, DF-Conformer-8 achieves SI-SNRi 20 dB and ESTOI 21 at real-time factor 22, compared with TDCN++ at SI-SNRi 23 dB, ESTOI 24, and RTF 25. The iterative iDF-Conformer-12 variant reaches SI-SNRi 26 dB and ESTOI 27 at RTF 28 (Koizumi et al., 2021).
The optimized ASR Conv-Former described in “Practical Conformer” is a streamlined, memory-lite Conformer encoder intended for ultra-low-latency, on-device ASR and as the first pass in a two-stage cascaded system (Botros et al., 2023). Its three principal interventions are replacing lower Conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Relative to a 29-layer causal Conformer with WER 30, size 31 M, FLOPs 32 M, and TPU latency 33 ms, the optimized Conv-Former reports WER 34, size 35 M, FLOPs 36 M, and TPU latency 37 ms, a 38 latency reduction (Botros et al., 2023). In a cascaded encoder design, the first-pass causal Conv-Former produces frame-level embeddings for a low-latency RNN-T decoder, and a second-pass non-causal Conformer operates on those embeddings when more compute is available. The second pass recovers WER to 39, matching the large-model pipeline (Botros et al., 2023).
Multi-Convformer revisits the Conformer convolution module itself. Instead of a single depthwise convolution of fixed kernel size, it applies 40 parallel depthwise convolutions with different kernel sizes and fuses them with gating; the best-performing choice is 41 with 42 (Prabhu et al., 2024). The default variant, MultiConv43, performs best among sum, weighted-sum, concatenation, and concat-plus-depthwise-conv fusion strategies. In a 44-layer AED model, Conformer has 45 M total parameters and Multi-Convformer46 has 47 M, a reported 48 overhead (Prabhu et al., 2024). On LS-100 under the AED setting, WER improves from 49 to 50 on Test Clean and from 51 to 52 on Test Other; on TEDLIUM-2 it improves from 53 to 54; and on AISHELL Test the CER improves from 55 to 56. The paper summarizes the effect as up to 57 relative WER improvements while remaining more parameter efficient than existing Conformer variants such as CgMLP and E-Branchformer (Prabhu et al., 2024).
6. Empirical patterns, limitations, and misconceptions
A consistent empirical pattern is that Conv-Former variants are used to reconcile local inductive bias with long-range context while improving the accuracy-speed-parameter trade-off. In recommendation, the claimed benefit arises from order sensitivity, large receptive field, and weight sharing (Wang et al., 2023). In super-resolution and fine-grained classification, large-kernel or convolutionally guided designs are used to retain broad spatial context without the quadratic cost of global self-attention or without downscaling away fine detail (Wu et al., 2024, Vaishnav et al., 2022). In medical segmentation, convolutional structure is explicitly used to mitigate attention collapse on small-scale training data and to preserve 58D feature-map geometry (Lin et al., 2023). In point clouds, the benefit is boundary-aware neighborhood selection while preserving the invariances of continuous point convolution (Wu et al., 2022).
The literature also records nontrivial trade-offs. In the optimized ASR Conv-Former, the small single-pass encoder incurs a WER increase from 59 to 60 before the cascaded second pass recovers accuracy (Botros et al., 2023). In DF-Conformer, FAVOR+ can slightly degrade ESTOI and SI-SNRi in the smallest setting, even though it enables scaling to large sequence lengths (Koizumi et al., 2021). In Multi-Convformer, adding too many kernels is not uniformly beneficial: the 61-kernel set 62 performs best on the cited ablation, whereas the 63-kernel set 64 degrades (Prabhu et al., 2024). The hierarchical medical segmentation ConvFormer also notes increased model complexity and training time, approximately 65–66 GPU-hours, and potential memory bottlenecks for very high resolutions (Gu et al., 2022).
Several misconceptions are therefore contradicted by the arXiv record. First, Conv-Former is not synonymous with an attention-free network: the 3D pose ConvFormer, CSA-based medical ConvFormer, Enhanced DeTrans ConvFormer, DF-Conformer, and Multi-Convformer all retain attention or Conformer-style attention modules (Diaz-Arias et al., 2023, Lin et al., 2023, Gu et al., 2022, Koizumi et al., 2021, Prabhu et al., 2024). Second, Conv-Former is not restricted to vision; the term is used in recommendation, speech enhancement, and ASR as well (Wang et al., 2023, Koizumi et al., 2021, Botros et al., 2023). Third, parameter reduction is common but not universal: some variants primarily target convergence behavior, attention diversity, or high-resolution handling rather than absolute parameter minimization (Vaishnav et al., 2022, Lin et al., 2023).
Taken together, the cited works support a narrower but more precise interpretation. “Conv-Former” names a family of architectures that treat convolution not as a legacy component to be displaced by attention, but as a first-class mechanism for locality, order sensitivity, multi-scale aggregation, efficiency, and stabilization. The specific implementation differs sharply by domain—large-kernel mixers in SISR, convolutional projections in pose estimation, adaptive attention kernels in medical imaging, feature-difference gating in point clouds, and multi-kernel or linear-attention Conformer modules in speech—but the unifying objective is recurrent: preserve the strengths of convolution while retaining the expressive and modular benefits of Transformer-family design.