Attentive Convolution (ATConv)

Updated 16 May 2026

Attentive Convolution (ATConv) is a neural operator that integrates convolutional efficiency with attention mechanisms to dynamically generate context-sensitive kernels.
It employs adaptive routing through context-to-kernel translation and lateral inhibition via differential kernel modulation to enhance feature discrimination.
ATConv achieves lower computational complexity and memory usage compared to self-attention while delivering state-of-the-art performance in both vision and language tasks.

Attentive Convolution (ATConv) refers to a class of neural operators that combine the strengths of convolutional and attention mechanisms to deliver adaptive, content-aware, and efficient information aggregation in both vision and language domains. Unlike static convolutions, which apply fixed filters across all spatial or temporal positions, ATConv integrates attention-based principles directly into the convolutional operator, enabling dynamic routing and feature competition within local receptive fields. This synthesis yields models with enhanced expressivity and locality, while retaining the computational and memory advantages of convolutional structures (Yu et al., 23 Oct 2025, Andreoli, 2019, Yin et al., 2017).

1. Fundamental Principles: Adaptive Routing and Lateral Inhibition

ATConv is grounded in two core ideas derived from a comparative analysis of standard convolution and self-attention (SA):

Adaptive Routing: Standard convolution applies a spatially-invariant kernel, resulting in a homogenized filtering process. Self-attention, by contrast, computes aggregation weights dynamically as a function of content via query-key interactions, allowing information flow to be routed semantically based on the input. ATConv integrates this adaptivity through mechanisms that generate kernels dependent on local/global context (Yu et al., 23 Oct 2025, Andreoli, 2019).
Lateral Inhibition: The softmax operation in SA naturally induces competition among weighted positions (“lateral inhibition”), suppressing redundancy and enhancing discriminative focus. In contrast, conventional convolutional kernels lack such competitive normalization, leading to over-smoothing and representational redundancy. ATConv explicitly embeds lateral inhibition via parametric, data-dependent kernel modulation that enforces center-surround antagonism in the convolutional response (Yu et al., 23 Oct 2025).

These principles position ATConv as a unification of the structural inductive biases of convolution with the expressivity and adaptivity of self-attention.

2. Mathematical Formulation and Operator Construction

The formal structure of ATConv for vision proceeds via three major steps:

Context-to-Kernel Translation (C2K, Adaptive Routing): The kernel weights for a $K\times K$ depthwise convolution are conditioned on the input via a parameter generator. This involves:

$\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$

Here, $\phi$ is a nonlinearity (e.g., GELU).

Differential Kernel Modulation (DKM, Lateral Inhibition): A per-channel average and sigmoid-controlled coefficient $\lambda_c$ modulate the kernel, enforcing competitive (center-surround) inhibition:

$\alpha_{b,c,u,v}^{\rm ATConv} = \mathbf{K}_{b,c,u,v} - \lambda_c\,\bar{\mathbf{K}}_{b,c}$

Value Projection and Aggregation: Input values are projected:

$\mathbf{V}_{b,c,h,w} = \sum_{i=1}^C W_{\rm value}^{(c,i)} \mathbf{X}_{b,i,h,w}$

The output is aggregated via depthwise convolution:

$\mathbf{Y}_{b,c,h,w} = \sum_{u=0}^{K-1} \sum_{v=0}^{K-1} \alpha_{b,c,u,v}^{\rm ATConv} \mathbf{V}_{b,c,h+u-p, w+v-p}$

(Yu et al., 23 Oct 2025)

In natural language processing, ATConv operators extend 1D convolutions to incorporate an attended, content-driven context:

For each position $i$ , compute an attentive summary $c_i$ from either self- or cross-sequence attention.
Update the representation via convolution over $[h_{i-1}, h_i, h_{i+1}, c_i]$ using jointly learned local and attention-driven kernels (Yin et al., 2017).

3. Unified Framework: Convolution as Structured Linear Operator

Any convolution (grid, graph, or attention-based) can be described as a factorization over an input-output-indexed linear map:

$\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 0

where $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 1 encode the structure: spatial shifts for grids, adjacency for graphs, and—crucially—dynamically computed attention matrices for ATConv. This perspective clarifies the transition from fixed convolutions to fully adaptive, content-aware "convolutions" in both vision and language (Andreoli, 2019). In particular, self-attention is a special case where $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 2 is computed via softmax of QK-dot-products.

4. Computational Complexity and Memory Analysis

ATConv retains linear complexity with respect to spatial size, analogous to standard convolutions:

ATConv: $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 3 for $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 4 channels. Storage per layer scales as $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 5.
Self-Attention (SA): $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 6 compute, $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 7 memory—limiting scalability at large $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 8.
Standard DW+PW Conv: $\mathbf{Z} = \text{AdaAvgPool}_{K\times K}(\mathrm{Conv}_{1\times1}(\mathbf{X})) \ \hat{\mathbf{K}} = \mathbf{W}_\mathrm{gen} \cdot \mathrm{Vec}(\phi(\mathbf{Z}))$ 9 compute, $\phi$ 0 memory.

Empirical benchmarks show ATConv achieves $\phi$ 1– $\phi$ 2 lower latency and $\phi$ 3– $\phi$ 4 the GPU footprint compared to SA in vision models at $\phi$ 5 resolution (Yu et al., 23 Oct 2025).

5. Architectural Variants and Drop-in Use

AttNet Family (Vision): ATConv forms the token-mixing operation in pure convolutional backbones, replacing self-attention entirely. Four-stage networks (AttNet-T1/T2/T3/T4) using only $\phi$ 6 ATConv tokens attain state-of-the-art classification accuracy with competitive parameter and Flops budgets, e.g., AttNet-T2 (27M params, 5.1G Flops, 84.4% ImageNet-1K top-1) (Yu et al., 23 Oct 2025).
Replacement in ViTs: Plugging ATConv in place of window or global attention in PVT and Swin yield consistent accuracy gains (e.g., PVT-Tiny: 75.1%→77.5%, throughput $\phi$ 7) (Yu et al., 23 Oct 2025).
NLP and Multimodal: In sentence-level processing, ATTCONV augments local convolution with attended, nonlocal context, outperforming attentive pooling and recurrent attention models on sentiment analysis (Yelp), entailment (SciTail), and fact verification (FEVER) tasks (Yin et al., 2017).

6. Empirical Results and Comparative Evaluation

The following summarizes key empirical results:

Model	Params	Flops	Top-1 (%) (ImageNet-1K)
AttNet-T1	13.7M	2.4G	82.8
AttNet-T2	27.0M	5.1G	84.4
AttNet-T3	49.1M	9.4G	85.3
AttNet-T4	87.3M	16.7G	85.6

For diffusion-based image generation, replacing ViT-style attention in SiT-XL/2 with $\phi$ 8 ATConv reduces FID from 1.97 to 1.82 (ImageNet 256), with $\phi$ 920% latency reduction (Yu et al., 23 Oct 2025).

Ablation studies confirm additive benefits: C2K (+2.88% Top-1), output projection (+0.71%), value projection (+1.52%), and DKM (+1.24%). Kernel-size ablation highlights that $\lambda_c$ 0 is optimal for ATConv's expressivity-cost tradeoff (Yu et al., 23 Oct 2025).

In NLP, ATConv (advanced) achieves 67.36% on Yelp sentiment, surpassing attentive pooling CNNs and attentive-LSTM baselines. For SciTail, ATConv (advanced) attains 79.2% (vs. 74.4% Bi-CNN, 71.5% Attentive-LSTM) (Yin et al., 2017). On multi-evidence fact verification (FEVER), ATConv achieves 62.3% (retrieved), 86.0% (gold evidence).

7. Interpretability, Expressivity, and Theoretical Context

ATConv offers several interpretive and practical advantages:

Parameter and Receptive Field Efficiency: Kernels are not statically tiled but learned via compact generators; attention structures can span arbitrary ranges in the input at modest computational cost (Andreoli, 2019).
Feature Alignment and Competition: Differential modulation sharpens feature selectivity, echoing neurobiological lateral inhibition and mitigating over-smoothing observed in standard CNNs (Yu et al., 23 Oct 2025).
Structural Unification: The tensor-factorization formalism unifies grid, graph, and attention-based convolutions under a single operator family, revealing that attention and structured convolution are architectural siblings rather than orthogonal approaches (Andreoli, 2019).

A plausible implication is that ATConv bridges the representational gap between local, inductive-bias-rich CNNs and globally-expressive, adaptable transformer-style architectures, enabling both efficient scaling and robust generalization.

References:

(Yu et al., 23 Oct 2025) "Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency"
(Andreoli, 2019) "Convolution, attention and structure embedding"
(Yin et al., 2017) "Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms"