Papers
Topics
Authors
Recent
2000 character limit reached

Gated Short Convolution Blocks

Updated 2 December 2025
  • Gated Short Convolution Blocks are architectural units that conditionally activate convolution operations via lightweight binary masks using stochastic methods like the Gumbel-Softmax trick.
  • They integrate both spatial and channel-wise gating, enabling selective computation that improves efficiency by reducing FLOPs while maintaining or boosting model accuracy.
  • Empirical benchmarks on CIFAR-10, ImageNet, and pose estimation tasks demonstrate significant throughput gains and accuracy improvements over traditional static architectures.

Gated Short Convolution Blocks are architectural primitives for deep neural networks that dynamically control the execution of convolutional operations at the spatial- or channel-level, thereby enabling conditional computation and improving both computational efficiency and modeling performance. Typically implemented as modifications of standard residual blocks, gated short convolution blocks feature lightweight gating units that generate binary masks to selectively activate convolutional kernels on a per-example basis. These masks are induced via stochastic relaxation mechanisms, notably the binary Gumbel-Softmax or Concrete trick, and trained end-to-end to optimize for both accuracy and dynamic sparsity objectives. Such blocks have been demonstrated to yield significant reductions in FLOPs, throughput improvements on modern hardware, and empirically higher accuracy than static compressed architectures of comparable cost (Verelst et al., 2019, Bejnordi et al., 2019).

1. Architectural Designs

Gated short convolution blocks instantiate two principal fine-grained gating paradigms: spatial gating and channel-wise gating.

  • Spatial Gating (Dynamic Convolutions): The block input XnRC×H×WX_n \in \mathbb{R}^{C \times H \times W} splits into a main residual path and a gating (mask) branch. The main path computes

    F(Xn)=Conv2(BN(ReLU(Conv1(Xn))))F(X_n) = \mathrm{Conv}_2(\mathrm{BN}(\mathrm{ReLU}(\mathrm{Conv}_1(X_n))))

    Typically, Conv1\mathrm{Conv}_1 is 1×11\times 1, followed by depthwise 3×33\times 3 in Conv2\mathrm{Conv}_2, and an optional 1×11\times1 bottleneck. The gating branch produces a per-pixel logit map mRH×Wm\in\mathbb{R}^{H\times W} via either a "squeeze unit" (classification) or 1×11\times1 conv module (pose). The binary mask G{0,1}H×WG \in \{0,1\}^{H\times W} is sampled using the binary Gumbel-Softmax (see Section 2). The output is

    Xn+1=ReLU(F(Xn)G+Xn)X_{n+1} = \mathrm{ReLU}(F(X_n) \odot G + X_n)

    with GG broadcast across channels (Verelst et al., 2019).

  • Channel-wise Gating (Conditional Channel Gated Block): For input xRC×H×Wx_\ell \in \mathbb{R}^{C_{\ell} \times H_{\ell} \times W_{\ell}}, a binary mask G(x)=[g1,...,gCout]G(x_\ell) = [g_1, ..., g_{C_\mathrm{out}}]^{\top} with gk{0,1}g_k \in \{0,1\} is computed by a small gating network operating on global average pooled features. Gating vector is inserted between two convolutional layers, yielding

    x+1=r(W2(G(x)r(W1x))+x)x_{\ell+1} = r\big( W_2 \ast (G(x_\ell) \odot r(W_1 \ast x_\ell)) + x_\ell \big)

    where rr is ReLU (Bejnordi et al., 2019).

2. Stochastic Gating and Training Mechanisms

Both architectures utilize discrete stochastic relaxation for mask generation:

  • Binary Gumbel-Softmax: For spatial-choice, soft-gate is computed as

    y1=σ(m+g1g2τ)y_1 = \sigma\left( \frac{m + g_1 - g_2}{\tau} \right)

    with g1,g2Gumbel(0,1)g_1,g_2 \sim \mathrm{Gumbel}(0,1) and temperature τ\tau. The hard gate is G=1y1>0.5G = 1_{y_1 > 0.5} in the forward pass, with gradients propagated via y1y_1 (Verelst et al., 2019).

  • Binary Concrete (Channel Gates): Logits are perturbed by Gumbel noise, scaled by τ\tau, and argmax is used for hard gating. Backward gradients utilize the sigmoid relaxation (Bejnordi et al., 2019).

Hyperparameters include τ\tau scheduling (fixed or annealed), thresholding at $0.5$, and deterministic gating during late training epochs to fine-tune biases.

3. Conditional Sparsity Losses and Training Objectives

Training gated short convolution blocks involves losses that simultaneously optimize task performance and conditional sparsity. Typical objective:

  • Sparsity Losses: Spatial gating uses 1\ell_1 norm on GG and block/network-level FLOP budget regularization,

    Lsparsity=λnG(n)1L_\mathrm{sparsity} = \lambda \sum_n \| G^{(n)} \|_1

    FLOP losses penalize deviation from budget θ\theta using network-wise and per-block bounds (see Eq.(8)–(11) in (Verelst et al., 2019)).

  • Batch-Shaping Loss: Channel gating introduces a batch-shaping mechanism matching empirical gate activations to a target Beta distribution via a Cramér–von Mises loss:

    S(x,λ)=λNi=1N(iN+1Ix(i)(a,b))2S(x^*,\lambda) = \frac{\lambda}{N} \sum_{i=1}^N \left( \frac{i}{N+1} - I_{x^*_{(i)}}(a,b) \right)^2

  • Final Objective:

    Ltotal=Ltask+Sbatchshaping+L0L_\mathrm{total} = L_\mathrm{task} + S_\mathrm{batch-shaping} + L_0

    L0L_0 is a learned penalty enforcing gate sparsity via a reparameterized sigmoid on the logits (Bejnordi et al., 2019).

4. CUDA Implementations and Sparse Inference

Efficient practical realization of dynamic gating, especially for spatial selection, requires specialized GPU kernels:

  • Gather-Scatter CUDA Workflow: Masked positions are gathered into a dense tensor TT, convolutions are performed only on activated sites, and the results are scattered back. Pseudocode:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    # Gather active positions
    for n in [0..N1]:
        for h,w where G[n,h,w]==1:
            T[idx,:,:,:]  X[n,:,h,w]
            M[idx] = (n,h,w)
            idx += 1
    # 1x1 and sparse 3x3 convolutions (cuDNN)
    T  Conv1x1(T)
    T  DW-Conv3x3_sparse(T, M)
    T_out  Conv1x1(T)
    # Scatter results
    for p in [0..P1]:
        (n,h,w) = M[p]
        Y[n,:,h,w] += T_out[p,:,:,:]
    Y  ReLU(Y)
    DW-Conv3x3_sparse leverages mapping MM to efficiently lookup neighbors before applying kernel (Verelst et al., 2019).

5. Empirical Results and FLOP Analysis

Empirical studies on CIFAR-10, ImageNet, MPII Pose, and Cityscapes demonstrate:

  • FLOP Reductions and Speedups: On MobileNetV2 (224x224, θ=0.25\theta=0.25), 37% fewer MACs yields 40% throughput gain. ShuffleNetV2 saw 35% MAC reduction, 25% throughput gain. Pose estimation with stacked hourglass blocks observed 75% MAC reduction with +180% images/sec (Verelst et al., 2019).
  • Accuracy vs. Efficiency Trade-offs: On CIFAR-10, gated blocks outperform SACT and ConvNet-AIG across all MAC budgets. ImageNet with ResNet-50 gated yields 74.60% top-1 accuracy at MACs similar to baseline ResNet-18 (69.76%). On Cityscapes, Gated-PSPNet50 (pretrained) achieves 74.4% mIoU at 0.76× MACs (Bejnordi et al., 2019).
Model Top-1 Accuracy MACs (normalized)
ResNet20 (full) 91.2% 40M
ResNet20-BAS + L₀ (γ=0.15) 91.8% 30M
ResNet34-BAS + γ=0.1 72.55% 1.67G
ResNet50-BAS + γ=0.15 74.60% 2.07G
ConvNet-AIG-34 72.20% 1.70G
Gated-PSPNet50(pretrained) 74.4% 0.76×

6. Gate Usage, Ablation Insights, and Conditionality

Ablation and usage studies reveal key behavioral properties:

  • Gate Conditionality: Batch-shaping enables >70%>70\% of gates to operate conditionally—active on some examples, inactive on others. 20%20\% become essential features (always-on), 10%10\% always-off. L₀ sparsity alone tends to produce coarse fixed sparsity without fine-grained conditionality (Bejnordi et al., 2019).
  • Input Difficulty Modulation: "Hard" ImageNet examples (small objects, fine textures) selectively trigger larger fractions of gates (\sim60–70%), while simple examples utilize less computation (\sim30–40%). Deeper layers specialize gating to class-specific semantics.

7. Training Recipes and Hyperparameters

Optimal training of gated short convolution blocks involves determination of architectural, stochastic, and loss function parameters:

  • Gating Branches: Classification tasks employ SACT-style squeeze units; pose estimation prefers economical 1×11\times1 conv-based units.
  • Stochasticity Schedules: τ\tau fixed in CIFAR/pose (1), annealed from 5→1 over 100 epochs for ImageNet. Final 20% epochs with deterministic gates (g1=g2=0g_1=g_2=0) for bias fine-tuning.
  • Sparsity and Batch-Shaping: Batch-shaping regularization (λ=0.75\lambda=0.75) is linearly annealed; L0L_0 sparsity penalty (γ\gamma in [0.01, 0.4]) introduced after batch-shaping warmup.
  • Optimization: Nesterov SGD with momentum 0.9, weight decay 5×1045 \times 10^{-4}, batch size 256 for CIFAR-10/ImageNet, poly learning rate for Cityscapes.
  • Loss Weights: Classification tasks set α=10\alpha=10, pose estimation α=0.01\alpha=0.01, per (Verelst et al., 2019).

These hyperparameter schedules and architectural choices collectively achieve end-to-end trainable, fine-grained, conditional gating that efficiently leverages model capacity for challenging inputs, achieving practical speedups substantiated by both FLOP count and hardware throughput measurements (Verelst et al., 2019, Bejnordi et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gated Short Convolution Blocks.