Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conv-Former: A Convolutional Transformer Hybrid

Updated 3 July 2026
  • Conv-Former is a hybrid design motif that integrates convolution with Transformer or Conformer blocks for diverse applications.
  • It uses convolution as a token mixer, feed-forward refiner, and dynamic gating mechanism in vision, speech, recommendation, and medical tasks.
  • Empirical results show enhanced efficiency, reduced parameters, and improved performance, making it a versatile architecture across domains.

Searching arXiv for the named paper and closely related ConvFormer variants to ground the article in current arXiv records. arXiv search query: (Vaishnav et al., 2022) Conviformers ConvFormer “Conv-Former” denotes a heterogeneous line of arXiv architectures that combine convolutional operators with Transformer or Conformer-style blocks rather than a single standardized model. Across the papers collected under this label, convolution is used as a token mixer, as a generator of Q/K/VQ/K/V projections, as a feed-forward refinement operator, as a dynamic gate on convolutional weights, or as a replacement for selected attention layers. The resulting models span fine-grained plant classification, point-cloud segmentation and scene flow, sequential recommendation, single-image super-resolution, monocular 3D human pose estimation, medical image segmentation, speech enhancement, and automatic speech recognition (Vaishnav et al., 2022, Wu et al., 2022, Wang et al., 2023, Wu et al., 2024, Diaz-Arias et al., 2023, Lin et al., 2023, Gu et al., 2022, Koizumi et al., 2021, Botros et al., 2023, Prabhu et al., 2024). This suggests that “Conv-Former” is best understood as a recurring hybrid design motif: preserve or reintroduce convolutional inductive bias while retaining some of the long-range modeling, residual structure, and modularity associated with Transformer-family networks.

1. Scope and nomenclature

The term appears in several closely related spellings—“Conviformer,” “ConvFormer,” “PointConvFormer,” “DF-Conformer,” and “Multi-Convformer”—and these names refer to architectures that are similar in intent but not identical in mechanism. In the vision literature, “Conviformers: Convolutionally guided Vision Transformer” introduces a convolutional transformer for fine-grained categorization of plants from herbarium sheets and pairs it with the PreSizer preprocessing method (Vaishnav et al., 2022). In recommendation, “ConvFormer: Revisiting Transformer for Sequential User Modeling” replaces self-attention with a Light Temporal Convolutional Network layer (Wang et al., 2023). In super-resolution, “Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach” uses a large-kernel convolutional feature mixer and an edge-preserving feed-forward network (Wu et al., 2024). In 3D human pose estimation, “ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention” retains scaled dot-product attention but makes the projections convolutional and multi-scale (Diaz-Arias et al., 2023).

Variant Domain Characteristic mechanism
“Conviformers: Convolutionally guided Vision Transformer” (Vaishnav et al., 2022) Fine-grained plant classification Higher-resolution handling and PreSizer
“PointConvFormer: Revenge of the Point-based Convolution” (Wu et al., 2022) Point-cloud segmentation and scene flow Point convolution re-weighted by feature-difference attention
“ConvFormer: Revisiting Transformer for Sequential User Modeling” (Wang et al., 2023) Next-item prediction Depth-wise convolution + channel-wise convolution with FFT acceleration
“Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach” (Wu et al., 2024) Lightweight SISR Large-kernel mixer + edge-preserving FFN
“ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation...” (Diaz-Arias et al., 2023) Monocular 3D pose estimation Dynamic multi-headed convolutional self-attention
“ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation” (Lin et al., 2023) Medical segmentation Pooling + CNN-style self-attention + convolutional FFN
“ConvFormer: Combining CNN and Transformer for Medical Image Segmentation” (Gu et al., 2022) Medical segmentation Hierarchical hybrid stem + Enhanced DeTrans
“DF-Conformer” (Koizumi et al., 2021), “Practical Conformer” (Botros et al., 2023), “Multi-Convformer” (Prabhu et al., 2024) Speech enhancement and ASR Linear attention, conv-only lower blocks, or multiple convolution kernels

The nomenclature is therefore broad rather than canonical. A plausible implication is that the name signals a design philosophy—convolution-guided Transformer-family modeling—more than a fixed blueprint.

2. Recurrent architectural motifs

A persistent motif is the replacement or restructuring of the Transformer token mixer while preserving the residual, normalization, and feed-forward scaffolding. In sequential user modeling, ConvFormer removes the Q/K/VQ/K/V projections and Softmax attention entirely and replaces the standard block with a depth-wise convolution (DWC) plus channel-wise convolution (CWC) sublayer, called the Light Temporal Convolutional Network layer (Wang et al., 2023). In lightweight super-resolution, the ConvFormer layer similarly replaces multi-head self-attention with a large-kernel depth-wise convolutional gate, while the FFN is reworked into an edge-preserving module (Wu et al., 2024). These designs make the “mixer + MLP” decomposition explicit without insisting on attention as the mixer.

A second motif is to keep attention but make it convolutional in how queries, keys, or values are formed. The 3D pose ConvFormer generates QQ, KK, and VV via small $1$D convolutions along the sequence dimension and aggregates multiple kernel sizes with learned nonnegative weights (Diaz-Arias et al., 2023). The medical segmentation ConvFormer with CNN-style self-attention projects QQ, KK, and VV by 3×33\times3 convolutions over Q/K/VQ/K/V0D feature maps, constructs self-attention matrices as convolution kernels with adaptive sizes, and follows them with a convolutional feed-forward network (Lin et al., 2023). The hierarchical medical segmentation ConvFormer based on Enhanced DeTrans inserts a depth-wise convolution into the Deformable Transformer feed-forward module and combines convolutional and deformable-attention branches in a residual-shaped hybrid stem (Gu et al., 2022).

A third motif reverses the direction of influence: attention modulates convolution rather than replacing it. PointConvFormer preserves the point-convolution operator and uses an attention score based on feature difference between points in the neighborhood to modify the convolutional weights at each point:

Q/K/VQ/K/V1

In the paper’s interpretation, this preserves the invariances from point convolution while using attention to select relevant points in the neighborhood for convolution (Wu et al., 2022).

A fourth motif is explicit complexity control. ConvFormer for recommendation accelerates full-sequence DWC using the convolution theorem,

Q/K/VQ/K/V2

reducing the DWC from Q/K/VQ/K/V3 to Q/K/VQ/K/V4 (Wang et al., 2023). DF-Conformer replaces quadratic self-attention by FAVOR+ linear attention and pairs it with stacked Q/K/VQ/K/V5-D dilated depthwise convolution layers (Koizumi et al., 2021). The optimized ASR Conv-Former replaces lower Conformer blocks with convolution-only blocks and uses an RNNAttention-Performer to reduce latency (Botros et al., 2023).

3. Vision, super-resolution, and medical segmentation

In fine-grained plant classification, “Conviformers” begins from the observation that fine-grained tasks require discovery of subtle differences between highly similar sub-classes and that such distinctions are often lost when images are downscaled to save memory and computational cost associated with vision transformers (Vaishnav et al., 2022). The model is presented as a convolutional transformer architecture that, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost. The same work introduces PreSizer, described as a novel, improved pre-processing technique to resize images better while preserving their original aspect ratios, which proved essential for classifying natural plants, and reports SoTA on Herbarium 202x and iNaturalist 2019 (Vaishnav et al., 2022).

In lightweight single-image super-resolution, ConvFormer is instantiated as the core layer of the CFSR network. The large-kernel mixer computes

Q/K/VQ/K/V6

followed by

Q/K/VQ/K/V7

with Q/K/VQ/K/V8 chosen as a good trade-off between receptive field and cost (Wu et al., 2024). The same paper contrasts global self-attention, local window self-attention, and the large-kernel mixer by complexity, and introduces the edge-preserving feed-forward network, whose depth-wise branch combines a learnable Q/K/VQ/K/V9 convolution with fixed Sobel and Laplacian filters through learnable Softmax gates. With QQ0 residual ConvFormer blocks, channel width QQ1, and QQ2 ConvFormer layers per block, the network has about QQ3 K parameters and QQ4 G FLOPs for QQ5 SR. On Urban100, CFSR achieves PSNR/SSIM QQ6 versus QQ7 of ShuffleMixer, a gain of QQ8 dB, while reducing parameters by QQ9 and FLOPs by KK0 (Wu et al., 2024).

Medical image segmentation contains two distinct ConvFormer lines. The plug-and-play CNN-style Transformer variant operates directly on KK1D feature maps through Pooling, CNN-Style Self-Attention (CSA), and a Convolutional Feed-Forward Network (CFFN) (Lin et al., 2023). Its stated motivation is attention collapse: on relatively limited well-annotated medical image data, attention maps can become similar or even identical. CSA forms an unnormalized cosine-similarity map

KK2

then multiplies it by a learnable Gaussian mask to obtain an adaptive convolution kernel KK3 over the value map (Lin et al., 2023). Across SETR, TransUNet, TransFuse, FAT-Net, and Patcher, the module yields consistent performance gains; for example, SETR on ACDC improves from Dice KK4 and HD KK5 to Dice KK6 and HD KK7, and across all backbones and datasets the reported gains are KK8–KK9 Dice and VV0–VV1 pt HD reduction (Lin et al., 2023).

The hierarchical ConvFormer for medical image segmentation instead adopts a U-shaped encoder-decoder architecture built from a shallow Conv stem, three residual-shaped hybrid stems, an additional multi-scale Enhanced DeTrans encoder, and a symmetric decoder (Gu et al., 2022). Enhanced DeTrans retains multi-scale deformable self-attention but re-designs the feed-forward module by inserting a VV2 or VV3 depth-wise convolution, while Enhanced Positional Encoding adds a learnable DWConv branch to sinusoidal positional encoding. On MM-WHS CT, this ConvFormer reports MeanDice VV4 with VV5 M parameters, compared with VV6 for UNETR and VV7 for CoTr. On a lymph node ultrasound dataset it reports IoU VV8 and F1 VV9, and on ISIC skin lesion segmentation it reports Jaccard $1$0 and Dice $1$1 (Gu et al., 2022).

4. Sequential, geometric, and pose modeling

For sequential user modeling in recommender systems, ConvFormer is explicitly derived from an empirical analysis of self-attention in next-item prediction. The paper identifies three essential criteria for an effective token mixer: order sensitivity, large receptive field, and lightweight architecture (Wang et al., 2023). The resulting model is a standard two-tower next-item model in which each Transformer-style block is replaced by a depth-wise convolution along the time axis and a channel-wise convolution per time step. After $1$2 stacked LighTCN layers, the final user representation is $1$3, the score for a candidate item is $1$4, and training uses a pairwise ranking loss (Wang et al., 2023). On four public datasets—Amazon-Beauty, Sports, Toys, and Yelp—with $1$5, $1$6, and $1$7 blocks, ConvFormer consistently achieves the highest Hit@5/10 and MRR on $1$8-vs-$1$9 tests, improves MRR by QQ0–QQ1 relative over the best Transformer baseline and by QQ2–QQ3 over FMLP-Rec, and, in the FFT-accelerated ConvFormer-F version, gives a QQ4–QQ5 speedup over SASRec when QQ6 is large (Wang et al., 2023).

PointConvFormer targets point-cloud segmentation and scene-flow estimation by combining point convolution, where filter weights are only based on relative position, with Transformer-style feature-based attention (Wu et al., 2022). Its theoretical motivation is tied to generalization: attention filters out neighbors whose feature difference is large, while the point-convolution part preserves translation- and rotation-invariant geometric priors. On ScanNet semantic segmentation at a QQ7 cm grid, PointConvFormer reports QQ8 mIoU / QQ9 ms / KK0 M parameters, compared with KK1 / KK2 ms / KK3 M for a PointConv bottleneck re-implementation and KK4 / KK5 ms / KK6 M for MinkowskiNet42. On SemanticKITTI it reports KK7 mIoU, exceeding RandLA-Net at KK8, MinkowskiNet at KK9, and SPVNAS at VV0. For scene flow, replacing PointConv with PointConvFormer in PointPWC-Net reduces EPE3D from VV1 m / VV2 m to VV3 m / VV4 m on FlyingThings3D / KITTI, described as VV5 error reduction (Wu et al., 2022).

In monocular 3D human pose estimation, ConvFormer follows a two-stage lift pipeline: a spatial ConvFormer first models human joint relations within individual frames, then a temporal ConvFormer fuses all frame embeddings across time to predict the VV6D pose of the middle frame (Diaz-Arias et al., 2023). The central operator is dynamic multi-headed convolutional self-attention, in which each head forms VV7, VV8, and VV9 by small 3×33\times30D convolutions and then aggregates multiple kernel sizes with learned nonnegative weights that sum to one. The temporal version introduces the temporal joints profile, in which each token already “sees” a 3×33\times31-sized temporal neighborhood of joint features before attention weights are computed (Diaz-Arias et al., 2023). Parameter reduction is a primary design goal: for a 3×33\times32-frame sequence and 3×33\times33 joints, ConvFormer uses approximately 3×33\times34 M parameters versus approximately 3×33\times35 M for MHFormer, a 3×33\times36 reduction; the 3×33\times37-frame variant uses 3×33\times38 M versus 3×33\times39 M, an Q/K/VQ/K/V00 saving; and at Q/K/VQ/K/V01 it uses Q/K/VQ/K/V02 M versus approximately Q/K/VQ/K/V03 M for PoseFormer, a Q/K/VQ/K/V04 cut (Diaz-Arias et al., 2023). On Human3.6M with CPN inputs and Q/K/VQ/K/V05, it reports Q/K/VQ/K/V06 mm average MPJPE under Protocol I, Q/K/VQ/K/V07 mm under Protocol II, and a Q/K/VQ/K/V08 reduction in velocity error under Protocol III versus the previous SOTA. On MPI-INF-3DHP with Q/K/VQ/K/V09, it reports Q/K/VQ/K/V10 PCK, Q/K/VQ/K/V11 AUC, and Q/K/VQ/K/V12 mm MPJPE (Diaz-Arias et al., 2023).

5. Speech enhancement and automatic speech recognition

In speech enhancement, DF-Conformer integrates Conv-TasNet and Conformer by using a Conformer-style mask-prediction network with linear-complexity FAVOR+ attention and Q/K/VQ/K/V13-D dilated depthwise convolutions (Koizumi et al., 2021). The encoder is a learned Q/K/VQ/K/V14-D convolutional analysis filterbank with window Q/K/VQ/K/V15 ms, hop Q/K/VQ/K/V16 ms, and output dimension Q/K/VQ/K/V17, followed by an Q/K/VQ/K/V18-layer DF-Conformer mask network and an overlap-add synthesis decoder. Each block applies half-step FFN, MHSA_FAVOR, GLU, dilated depthwise convolution, BatchNorm, Swish, pointwise dense projection, dropout, and a second half-step FFN, then LayerNorm (Koizumi et al., 2021). The model was trained on Q/K/VQ/K/V19 hours of noisy speech data. In the reported comparisons, DF-Conformer-8 achieves SI-SNRi Q/K/VQ/K/V20 dB and ESTOI Q/K/VQ/K/V21 at real-time factor Q/K/VQ/K/V22, compared with TDCN++ at SI-SNRi Q/K/VQ/K/V23 dB, ESTOI Q/K/VQ/K/V24, and RTF Q/K/VQ/K/V25. The iterative iDF-Conformer-12 variant reaches SI-SNRi Q/K/VQ/K/V26 dB and ESTOI Q/K/VQ/K/V27 at RTF Q/K/VQ/K/V28 (Koizumi et al., 2021).

The optimized ASR Conv-Former described in “Practical Conformer” is a streamlined, memory-lite Conformer encoder intended for ultra-low-latency, on-device ASR and as the first pass in a two-stage cascaded system (Botros et al., 2023). Its three principal interventions are replacing lower Conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Relative to a Q/K/VQ/K/V29-layer causal Conformer with WER Q/K/VQ/K/V30, size Q/K/VQ/K/V31 M, FLOPs Q/K/VQ/K/V32 M, and TPU latency Q/K/VQ/K/V33 ms, the optimized Conv-Former reports WER Q/K/VQ/K/V34, size Q/K/VQ/K/V35 M, FLOPs Q/K/VQ/K/V36 M, and TPU latency Q/K/VQ/K/V37 ms, a Q/K/VQ/K/V38 latency reduction (Botros et al., 2023). In a cascaded encoder design, the first-pass causal Conv-Former produces frame-level embeddings for a low-latency RNN-T decoder, and a second-pass non-causal Conformer operates on those embeddings when more compute is available. The second pass recovers WER to Q/K/VQ/K/V39, matching the large-model pipeline (Botros et al., 2023).

Multi-Convformer revisits the Conformer convolution module itself. Instead of a single depthwise convolution of fixed kernel size, it applies Q/K/VQ/K/V40 parallel depthwise convolutions with different kernel sizes and fuses them with gating; the best-performing choice is Q/K/VQ/K/V41 with Q/K/VQ/K/V42 (Prabhu et al., 2024). The default variant, MultiConvQ/K/VQ/K/V43, performs best among sum, weighted-sum, concatenation, and concat-plus-depthwise-conv fusion strategies. In a Q/K/VQ/K/V44-layer AED model, Conformer has Q/K/VQ/K/V45 M total parameters and Multi-ConvformerQ/K/VQ/K/V46 has Q/K/VQ/K/V47 M, a reported Q/K/VQ/K/V48 overhead (Prabhu et al., 2024). On LS-100 under the AED setting, WER improves from Q/K/VQ/K/V49 to Q/K/VQ/K/V50 on Test Clean and from Q/K/VQ/K/V51 to Q/K/VQ/K/V52 on Test Other; on TEDLIUM-2 it improves from Q/K/VQ/K/V53 to Q/K/VQ/K/V54; and on AISHELL Test the CER improves from Q/K/VQ/K/V55 to Q/K/VQ/K/V56. The paper summarizes the effect as up to Q/K/VQ/K/V57 relative WER improvements while remaining more parameter efficient than existing Conformer variants such as CgMLP and E-Branchformer (Prabhu et al., 2024).

6. Empirical patterns, limitations, and misconceptions

A consistent empirical pattern is that Conv-Former variants are used to reconcile local inductive bias with long-range context while improving the accuracy-speed-parameter trade-off. In recommendation, the claimed benefit arises from order sensitivity, large receptive field, and weight sharing (Wang et al., 2023). In super-resolution and fine-grained classification, large-kernel or convolutionally guided designs are used to retain broad spatial context without the quadratic cost of global self-attention or without downscaling away fine detail (Wu et al., 2024, Vaishnav et al., 2022). In medical segmentation, convolutional structure is explicitly used to mitigate attention collapse on small-scale training data and to preserve Q/K/VQ/K/V58D feature-map geometry (Lin et al., 2023). In point clouds, the benefit is boundary-aware neighborhood selection while preserving the invariances of continuous point convolution (Wu et al., 2022).

The literature also records nontrivial trade-offs. In the optimized ASR Conv-Former, the small single-pass encoder incurs a WER increase from Q/K/VQ/K/V59 to Q/K/VQ/K/V60 before the cascaded second pass recovers accuracy (Botros et al., 2023). In DF-Conformer, FAVOR+ can slightly degrade ESTOI and SI-SNRi in the smallest setting, even though it enables scaling to large sequence lengths (Koizumi et al., 2021). In Multi-Convformer, adding too many kernels is not uniformly beneficial: the Q/K/VQ/K/V61-kernel set Q/K/VQ/K/V62 performs best on the cited ablation, whereas the Q/K/VQ/K/V63-kernel set Q/K/VQ/K/V64 degrades (Prabhu et al., 2024). The hierarchical medical segmentation ConvFormer also notes increased model complexity and training time, approximately Q/K/VQ/K/V65–Q/K/VQ/K/V66 GPU-hours, and potential memory bottlenecks for very high resolutions (Gu et al., 2022).

Several misconceptions are therefore contradicted by the arXiv record. First, Conv-Former is not synonymous with an attention-free network: the 3D pose ConvFormer, CSA-based medical ConvFormer, Enhanced DeTrans ConvFormer, DF-Conformer, and Multi-Convformer all retain attention or Conformer-style attention modules (Diaz-Arias et al., 2023, Lin et al., 2023, Gu et al., 2022, Koizumi et al., 2021, Prabhu et al., 2024). Second, Conv-Former is not restricted to vision; the term is used in recommendation, speech enhancement, and ASR as well (Wang et al., 2023, Koizumi et al., 2021, Botros et al., 2023). Third, parameter reduction is common but not universal: some variants primarily target convergence behavior, attention diversity, or high-resolution handling rather than absolute parameter minimization (Vaishnav et al., 2022, Lin et al., 2023).

Taken together, the cited works support a narrower but more precise interpretation. “Conv-Former” names a family of architectures that treat convolution not as a legacy component to be displaced by attention, but as a first-class mechanism for locality, order sensitivity, multi-scale aggregation, efficiency, and stabilization. The specific implementation differs sharply by domain—large-kernel mixers in SISR, convolutional projections in pose estimation, adaptive attention kernels in medical imaging, feature-difference gating in point clouds, and multi-kernel or linear-attention Conformer modules in speech—but the unifying objective is recurrent: preserve the strengths of convolution while retaining the expressive and modular benefits of Transformer-family design.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conv-Former.