ACNet: Asymmetric Convolutional Networks

Updated 18 March 2026

ACNet is a family of CNN architectures that enhance the central kernel 'skeleton' by combining standard square convolutions with parallel 1D horizontal and vertical filters.
The design fuses these asymmetric branches with batch normalization and additive operations, achieving improved accuracy and feature selectivity in both classification and super-resolution tasks.
Empirical results on benchmarks like CIFAR, ImageNet, and SISR datasets demonstrate ACNet’s robustness to distortions and competitive performance without increasing inference-time computational cost.

Asymmetric Convolutional Networks (ACNet) refer to a family of convolutional neural network architectures that employ asymmetric convolutional operations—namely, 1D convolutions in the horizontal and vertical directions—in parallel with standard square convolutions. The principal objective is to strengthen the representation of the central “skeleton” (cross of row and column) within convolutional kernels, which has empirically been shown to possess greater importance for discriminative tasks. ACNet, in both classification and image super-resolution settings, integrates these asymmetric operations via specialized blocks (Asymmetric Convolution Blocks for classification; Asymmetric Blocks for SISR), resulting in architectures that yield improved accuracy, robustness to input distortions, and enhanced feature selectivity—all with no increase in inference-time computational or memory cost after a model transformation that merges the asymmetric branches into standard square convolutions (Ding et al., 2019, Tian et al., 2021).

1. Architectural Foundations and Motivation

In conventional CNNs, $d\times d$ convolutional kernels uniformly cover spatial neighborhoods without explicit bias toward particular spatial arrangements. However, kernel visualizations reveal that the central row and column ("skeleton") typically contain higher-magnitude weights and thus bear greater representational load than the corners. ACNet explicitly augments every standard $d\times d$ convolutional layer (for instance, $3\times3$ ) by including two parallel lightweight asymmetric branches: a $1\times d$ (horizontal) and a $d\times1$ (vertical) convolution. Each of these branches operates on the same input, is followed by an independent batch normalization, and is fused with the output of the square convolutions via summation.

In image super-resolution (SISR), the observation is similar: standard square convolution treats all pixels equally, but key “power pixels”—usually edges and corners—contribute disproportionately to sharpness and detail. Asymmetric convolutions intensify feature learning along principal spatial axes, enabling models to better reconstruct high-frequency content (Tian et al., 2021).

2. Mathematical Formalism and Block Design

The Asymmetric Convolution Block (ACB) used in classification comprises three parallel convolution+batch normalization operations:

$d\times d$ convolution (standard kernel)
$1\times d$ convolution (horizontal kernel)
$d\times1$ convolution (vertical kernel)

Each branch computes its own feature map; all three outputs are summed before any activation is applied.

The additivity property for convolutions allows the output of a set of kernels with compatible supports to be consolidated: for single-channel input $I$ and kernels $K^{(1)}\in\mathbb{R}^{d\times d}$ , $K^{(2)}\in\mathbb{R}^{1\times d}$ , $K^{(3)}\in\mathbb{R}^{d\times1}$ ,

$I * K^{(1)} + I * K^{(2)} + I * K^{(3)} = I * \big( K^{(1)} \oplus \mathrm{pad}_{\text{center}}(K^{(2)}) \oplus \mathrm{pad}_{\text{center}}(K^{(3)}) \big)$

where $\oplus$ denotes elementwise addition and $\mathrm{pad}_{\text{center}}$ inserts 1D kernel weights into the center row/column of a $d\times d$ grid.

During inference, batch normalization parameters are also fused per branch, and the entire block is algebraically collapsed into a single equivalent $d\times d$ convolution.

In SISR, the Asymmetric Block (AB) applies the same decomposition in each of its 17 residual layers: three parallel convolutions— $3\times1$ , $3\times3$ , and $1\times3$ —whose results are summed and followed by ReLU activation. This AB is stacked before a Memory Enhancement Block (MEB) and a High-Frequency Feature Enhancement Block (HFFEB) to reconstruct high-resolution images (Tian et al., 2021).

3. Training and Inference Workflow

ACNet introduces no new hyperparameters or special optimization procedures. The architectural substitution involves the following steps:

For each eligible $d\times d$ conv + BN layer in a baseline CNN (e.g., VGG, ResNet, DenseNet), insert an ACB in its place.
Train the modified model end-to-end using the same loss, optimizer, schedule, and augmentations as the original model.
Upon convergence, fuse each ACB into a single $d\times d$ convolution: conduct batchnorm fusion in each branch, then overlay the kernels into one composite kernel as described above.
Finally, reconstruct the original architecture, initializing its standard conv parameters with the merged kernels and biases.

At inference, the architecture and FLOPs are exactly those of the starting network; there is no increase in model size, latency, or memory consumption (Ding et al., 2019).

In the super-resolution domain, ACNet for SISR follows a feed-forward pipeline:

The input LR image passes through the AB (17 layers),
Low-frequency features are fused via the MEB,
High-frequency fusion and image reconstruction occur in HFFEB,
The full network is trained with MSE loss, Adam optimizer, and a staged learning rate schedule (Tian et al., 2021).

4. Empirical Performance and Analytical Insights

On classical image classification benchmarks (CIFAR-10, CIFAR-100, ImageNet), ACNet consistently improves top-1 accuracy across architectures:

Model	Baseline (CIFAR-10)	ACNet (CIFAR-10)	Δ Top-1 (%)
Cifar-quick	83.13	84.24	+1.11
VGG-16	94.12	94.47	+0.35
ResNet-56	94.31	95.09	+0.78
WRN-16-8	95.56	96.15	+0.59
DenseNet-40	94.29	94.84	+0.55

(Table format and values from (Ding et al., 2019))

Analogous improvements are observed on CIFAR-100 and ImageNet, with gains ranging from +0.2% to +1.5% top-1/top-5 accuracy, depending on the backbone.

Ablation studies demonstrate superior robustness to rotational and flipping distortions: models with both horizontal and vertical asymmetric branches outperform controls across all examined data augmentations. Kernel inspection confirms that the skeleton entries (central row and column) in a folded ACNet have significantly larger $l_1$ magnitudes than corners, and pruning these weights disproportionately degrades accuracy, validating the hypothesis that skeleton strengthening is critical for representational efficacy (Ding et al., 2019).

In SISR, ACNet achieves state-of-the-art or competitive PSNR on Set5, Set14, B100, and Urban100 for $\times2$ , $\times3$ , and $\times4$ tasks, with speed and parameter count advantages. For instance, in Set14 ( $\times3$ ), ACNet reaches 30.19 dB vs. 30.12 dB for preceding approaches; on Urban100, ACNet-M outperforms alternatives under blind noise settings (Tian et al., 2021).

5. Practical Integration and Limitations

ACNet operates as a plug-and-play design for any CNN with $d\times d$ conv + BN layers:

Integration recipe: Replace each $d\times d$ layer with an ACB, train as usual, then fold the blocks at inference.
No extra hyperparameters: All configurations—branch widths, loss, optimizer—are unaltered from the baseline.
Tasks: Any CNN-based model in classification, detection, or segmentation domains is compatible.

Trade-offs:

Training FLOPs and memory per affected layer increase by $\sim3\times$ during training due to the parallel branches, but inference cost is unchanged.
For layers without BN or with non-standard dilation/stride, manual adjustments may be necessary for fusion.
Explicit architectural modifications for SISR employ additional memory fusion and high-frequency enhancement modules before final output, yielding end-to-end networks optimized for super-resolving images with high fidelity and robustness (Tian et al., 2021).

6. Extensions, Domain Adaptations, and Open Directions

In classification and dense prediction, folding asymmetric branches after training exposes no compatibility issues for downstream models, as the final structure and parameters are preserved. In super-resolution, the composite design can potentially be extended by:

Learning kernel lengths or branch weights adaptively.
Employing attention to modulate branch contributions.
Replacing sub-pixel upsampling in the SISR pipeline with deformable convolutions to better manage real-world local distortions.

A plausible implication is that asymmetric convolutional approaches may benefit other low-level vision tasks (e.g., denoising, deblurring) where local salient features have nonuniform spatial distribution (Tian et al., 2021). The explicit focus on the central skeleton further provides a paradigm for analyzing and tailoring kernel structures in more general settings.

Whereas prior “inception”-style designs incorporate multi-scale parallel branches for spatial diversity, ACNet focuses specifically on strengthening the intrinsic skeleton via 1D directional convolutions. The model transformation/folding ensures that all benefits accrued during training are realized at no added cost at deployment, distinguishing ACNet from methods that retain enlarged architectures or auxiliary branches at test time (Ding et al., 2019, Tian et al., 2021). This architecture-neutral principle makes ACNet particularly attractive as a generic upgrade for wide classes of convolutional models.

Markdown Report Issue Upgrade to Chat

References (2)

ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks (2019)

Asymmetric CNN for image super-resolution (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Convolutional Networks (ACNet).