Gated Short Convolution Blocks

Updated 2 December 2025

Gated Short Convolution Blocks are architectural units that conditionally activate convolution operations via lightweight binary masks using stochastic methods like the Gumbel-Softmax trick.
They integrate both spatial and channel-wise gating, enabling selective computation that improves efficiency by reducing FLOPs while maintaining or boosting model accuracy.
Empirical benchmarks on CIFAR-10, ImageNet, and pose estimation tasks demonstrate significant throughput gains and accuracy improvements over traditional static architectures.

Gated Short Convolution Blocks are architectural primitives for deep neural networks that dynamically control the execution of convolutional operations at the spatial- or channel-level, thereby enabling conditional computation and improving both computational efficiency and modeling performance. Typically implemented as modifications of standard residual blocks, gated short convolution blocks feature lightweight gating units that generate binary masks to selectively activate convolutional kernels on a per-example basis. These masks are induced via stochastic relaxation mechanisms, notably the binary Gumbel-Softmax or Concrete trick, and trained end-to-end to optimize for both accuracy and dynamic sparsity objectives. Such blocks have been demonstrated to yield significant reductions in FLOPs, throughput improvements on modern hardware, and empirically higher accuracy than static compressed architectures of comparable cost (Verelst et al., 2019, Bejnordi et al., 2019).

1. Architectural Designs

Gated short convolution blocks instantiate two principal fine-grained gating paradigms: spatial gating and channel-wise gating.

Spatial Gating (Dynamic Convolutions): The block input $X_n \in \mathbb{R}^{C \times H \times W}$ splits into a main residual path and a gating (mask) branch. The main path computes

$F(X_n) = \mathrm{Conv}_2(\mathrm{BN}(\mathrm{ReLU}(\mathrm{Conv}_1(X_n))))$

Typically, $\mathrm{Conv}_1$ is $1\times 1$ , followed by depthwise $3\times 3$ in $\mathrm{Conv}_2$ , and an optional $1\times1$ bottleneck. The gating branch produces a per-pixel logit map $m\in\mathbb{R}^{H\times W}$ via either a "squeeze unit" (classification) or $1\times1$ conv module (pose). The binary mask $G \in \{0,1\}^{H\times W}$ is sampled using the binary Gumbel-Softmax (see Section 2). The output is

$X_{n+1} = \mathrm{ReLU}(F(X_n) \odot G + X_n)$

with $G$ broadcast across channels (Verelst et al., 2019).
Channel-wise Gating (Conditional Channel Gated Block): For input $x_\ell \in \mathbb{R}^{C_{\ell} \times H_{\ell} \times W_{\ell}}$ , a binary mask $G(x_\ell) = [g_1, ..., g_{C_\mathrm{out}}]^{\top}$ with $g_k \in \{0,1\}$ is computed by a small gating network operating on global average pooled features. Gating vector is inserted between two convolutional layers, yielding

$x_{\ell+1} = r\big( W_2 \ast (G(x_\ell) \odot r(W_1 \ast x_\ell)) + x_\ell \big)$

where $r$ is ReLU (Bejnordi et al., 2019).

2. Stochastic Gating and Training Mechanisms

Both architectures utilize discrete stochastic relaxation for mask generation:

Binary Gumbel-Softmax: For spatial-choice, soft-gate is computed as

$y_1 = \sigma\left( \frac{m + g_1 - g_2}{\tau} \right)$

with $g_1,g_2 \sim \mathrm{Gumbel}(0,1)$ and temperature $\tau$ . The hard gate is $G = 1_{y_1 > 0.5}$ in the forward pass, with gradients propagated via $y_1$ (Verelst et al., 2019).
Binary Concrete (Channel Gates): Logits are perturbed by Gumbel noise, scaled by $\tau$ , and argmax is used for hard gating. Backward gradients utilize the sigmoid relaxation (Bejnordi et al., 2019).

Hyperparameters include $\tau$ scheduling (fixed or annealed), thresholding at $0.5$, and deterministic gating during late training epochs to fine-tune biases.

3. Conditional Sparsity Losses and Training Objectives

Training gated short convolution blocks involves losses that simultaneously optimize task performance and conditional sparsity. Typical objective:

Sparsity Losses: Spatial gating uses $\ell_1$ norm on $G$ and block/network-level FLOP budget regularization,

$L_\mathrm{sparsity} = \lambda \sum_n \| G^{(n)} \|_1$

FLOP losses penalize deviation from budget $\theta$ using network-wise and per-block bounds (see Eq.(8)–(11) in (Verelst et al., 2019)).
Batch-Shaping Loss: Channel gating introduces a batch-shaping mechanism matching empirical gate activations to a target Beta distribution via a Cramér–von Mises loss:

$S(x^*,\lambda) = \frac{\lambda}{N} \sum_{i=1}^N \left( \frac{i}{N+1} - I_{x^*_{(i)}}(a,b) \right)^2$
Final Objective:

$L_\mathrm{total} = L_\mathrm{task} + S_\mathrm{batch-shaping} + L_0$

$L_0$ is a learned penalty enforcing gate sparsity via a reparameterized sigmoid on the logits (Bejnordi et al., 2019).

4. CUDA Implementations and Sparse Inference

Efficient practical realization of dynamic gating, especially for spatial selection, requires specialized GPU kernels:

Gather-Scatter CUDA Workflow: Masked positions are gathered into a dense tensor

T

, convolutions are performed only on activated sites, and the results are scattered back. Pseudocode:

# Gather active positions
for n in [0..N−1]:
    for h,w where G[n,h,w]==1:
        T[idx,:,:,:] ← X[n,:,h,w]
        M[idx] = (n,h,w)
        idx += 1
# 1x1 and sparse 3x3 convolutions (cuDNN)
T′ ← Conv1x1(T)
T′′ ← DW-Conv3x3_sparse(T′, M)
T_out ← Conv1x1(T′′)
# Scatter results
for p in [0..P−1]:
    (n,h,w) = M[p]
    Y[n,:,h,w] += T_out[p,:,:,:]
Y ← ReLU(Y)

DW-Conv3x3_sparse leverages mapping

M

to efficiently lookup neighbors before applying kernel (Verelst et al., 2019).

5. Empirical Results and FLOP Analysis

Empirical studies on CIFAR-10, ImageNet, MPII Pose, and Cityscapes demonstrate:

FLOP Reductions and Speedups: On MobileNetV2 (224x224, $\theta=0.25$ ), 37% fewer MACs yields 40% throughput gain. ShuffleNetV2 saw 35% MAC reduction, 25% throughput gain. Pose estimation with stacked hourglass blocks observed 75% MAC reduction with +180% images/sec (Verelst et al., 2019).
Accuracy vs. Efficiency Trade-offs: On CIFAR-10, gated blocks outperform SACT and ConvNet-AIG across all MAC budgets. ImageNet with ResNet-50 gated yields 74.60% top-1 accuracy at MACs similar to baseline ResNet-18 (69.76%). On Cityscapes, Gated-PSPNet50 (pretrained) achieves 74.4% mIoU at 0.76× MACs (Bejnordi et al., 2019).

Model	Top-1 Accuracy	MACs (normalized)
ResNet20 (full)	91.2%	40M
ResNet20-BAS + L₀ (γ=0.15)	91.8%	30M
ResNet34-BAS + γ=0.1	72.55%	1.67G
ResNet50-BAS + γ=0.15	74.60%	2.07G
ConvNet-AIG-34	72.20%	1.70G
Gated-PSPNet50(pretrained)	74.4%	0.76×

6. Gate Usage, Ablation Insights, and Conditionality

Ablation and usage studies reveal key behavioral properties:

Gate Conditionality: Batch-shaping enables $>70\%$ of gates to operate conditionally—active on some examples, inactive on others. $20\%$ become essential features (always-on), $10\%$ always-off. L₀ sparsity alone tends to produce coarse fixed sparsity without fine-grained conditionality (Bejnordi et al., 2019).
Input Difficulty Modulation: "Hard" ImageNet examples (small objects, fine textures) selectively trigger larger fractions of gates ( $\sim$ 60–70%), while simple examples utilize less computation ( $\sim$ 30–40%). Deeper layers specialize gating to class-specific semantics.

7. Training Recipes and Hyperparameters

Optimal training of gated short convolution blocks involves determination of architectural, stochastic, and loss function parameters:

Gating Branches: Classification tasks employ SACT-style squeeze units; pose estimation prefers economical $1\times1$ conv-based units.
Stochasticity Schedules: $\tau$ fixed in CIFAR/pose (1), annealed from 5→1 over 100 epochs for ImageNet. Final 20% epochs with deterministic gates ( $g_1=g_2=0$ ) for bias fine-tuning.
Sparsity and Batch-Shaping: Batch-shaping regularization ( $\lambda=0.75$ ) is linearly annealed; $L_0$ sparsity penalty ( $\gamma$ in [0.01, 0.4]) introduced after batch-shaping warmup.
Optimization: Nesterov SGD with momentum 0.9, weight decay $5 \times 10^{-4}$ , batch size 256 for CIFAR-10/ImageNet, poly learning rate for Cityscapes.
Loss Weights: Classification tasks set $\alpha=10$ , pose estimation $\alpha=0.01$ , per (Verelst et al., 2019).

These hyperparameter schedules and architectural choices collectively achieve end-to-end trainable, fine-grained, conditional gating that efficiently leverages model capacity for challenging inputs, achieving practical speedups substantiated by both FLOP count and hardware throughput measurements (Verelst et al., 2019, Bejnordi et al., 2019).