Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shake-Shake Regularization in Residual Networks

Updated 7 June 2026
  • Shake-shake regularization is a stochastic technique for multi-branch deep neural networks that replaces fixed summation with random affine combinations to boost generalization.
  • It introduces independent random coefficients in both the forward and backward passes, effectively decorrelating branch outputs and reducing overfitting.
  • Empirical evidence on CIFAR-10 and CIFAR-100 demonstrates significant test error reduction and showcases the method’s robustness even in architectures without BatchNorm or skip connections.

Shake-shake regularization is a stochastic regularization technique designed for multi-branch deep neural networks, especially residual architectures, to mitigate overfitting by replacing the deterministic summation of parallel residual branches with a randomly sampled affine combination during training. Introduced in the context of 3-branch networks, Shake-shake regularization achieves improved test error on standard benchmarks, outperforming prominent contemporary architectures, and demonstrates effectiveness even in the absence of architectural components such as skip connections or Batch Normalization (Gastaldi, 2017).

1. Mathematical Formulation

Shake-shake regularization modifies the standard aggregation in multi-branch residual network blocks. Given a block input tensor xix_i and BB parallel branches with residual functions Fk(xi;Wik)F_k(x_i; W_i^k), the classical update is

xi+1=xi+k=1BFk(xi;Wik).x_{i+1} = x_i + \sum_{k=1}^{B} F_k(x_i; W_i^k).

In shake-shake regularization, the output is instead

xi+1=xi+αiF1(xi;Wi1)+βiF2(xi;Wi2)+γiF3(xi;Wi3),x_{i+1} = x_i + \alpha_i F_1(x_i; W_i^1) + \beta_i F_2(x_i; W_i^2) + \gamma_i F_3(x_i; W_i^3),

under the constraint αi+βi+γi=1\alpha_i + \beta_i + \gamma_i = 1 for the 3-branch case. The coefficients (αi,βi,γi)(\alpha_i, \beta_i, \gamma_i) are sampled from the uniform simplex by drawing independently from Uniform(0,1) and normalizing or using a Dirichlet(1,1,1)(1,1,1) distribution. Immediately before the backward pass, the coefficients are re-sampled independently, introducing stochasticity into both activations and gradients.

For the gradient computation, if LL is the loss and δi=L/xi+1\delta_i = \partial L / \partial x_{i+1}, the gradient w.r.t. the BB0-th residual output is scaled by the corresponding backward coefficient:

BB1

where the backward coefficients are independently drawn, causing the effective gradient path to differ from the forward affine combination. At inference, coefficients are set deterministically to their mean values—BB2 each for the 3-branch block—to recover standard ensemble-like deterministic behavior.

2. Algorithmic Implementation

A single training iteration for a residual block using shake-shake proceeds as follows:

  1. Initialization: Initialize weights BB3, set learning rate BB4, select batch size BB5.
  2. Forward Pass: For each block BB6, sample BB7 DirichletBB8 (either at batch-level or image-level). Compute block output as a weighted affine sum of branch outputs using these coefficients.
  3. Loss Computation: Calculate training loss BB9.
  4. Backward Pass: For each block Fk(xi;Wik)F_k(x_i; W_i^k)0, independently sample backward coefficients Fk(xi;Wik)F_k(x_i; W_i^k)1 and perform gradient back-propagation, scaling the gradient to each branch by its sampled backward coefficient.
  5. Parameter Update: Update weights via Fk(xi;Wik)F_k(x_i; W_i^k)2 (typically using SGD with Nesterov momentum).

Coefficients can be sampled per block for the entire batch (batch-level) or per image (image-level), with image-level sampling providing stronger regularization.

3. Network Architectures and Hyperparameters

Shake-shake regularization is instantiated in both pre-activation 3-branch ResNet variants and ResNeXt-style models:

  • CIFAR-10 Setup: 26-layer pre-activation 3-branch ResNet; each block contains ReLU, ConvFk(xi;Wik)F_k(x_i; W_i^k)3, BatchNorm, repeated twice per branch; feature map sizes progress as Fk(xi;Wik)F_k(x_i; W_i^k)4 with width doubling at each downsampling; variant labels such as "2×32d", "2×64d", "2×96d" denote initial filter counts.
  • CIFAR-100 Setup: 29-layer, 2 residual branch ResNeXt-style (each with 4 grouped convolutions, 64 channels per group, 34.4M parameters), without pre-activation.
  • Training Regimen: 1800 epochs; SGD with Nesterov momentum; initial learning rate 0.2, annealed to 0 via cosine schedule; batch sizes 128 (CIFAR-10) and 32 (CIFAR-100); standard data augmentation of random translations and horizontal flips.

4. Empirical Evaluation

Shake-shake regularization achieves state-of-the-art or near state-of-the-art results on CIFAR-10 and CIFAR-100. Key findings:

  • CIFAR-10 Test Error: Baseline 26-2×32d ResNet achieves 4.27%. Shake-Shake-Image on 26-2×96d achieves 2.86% (average over 5 runs), outperforming DenseNet-BC (Fk(xi;Wik)F_k(x_i; W_i^k)5) at 3.46% and ResNeXt-29,16×64d at 3.58%.
  • CIFAR-100 Test Error: Baseline ResNeXt-29,2×4×64d records 16.34%. Shake-Even-Image variant achieves 15.85% (average over 3 runs).
Method Depth Params CIFAR-10 CIFAR-100
Wide ResNet 28 36.5 M 3.80% 18.30%
ResNeXt-29,16×64d 29 68.1 M 3.58% 17.31%
DenseNet-BC(Fk(xi;Wik)F_k(x_i; W_i^k)6) 190 25.6 M 3.46% 17.18%
Shake-Shake-Image (C10) 26 26.2 M 2.86%
Shake-Even-Image (C100) 29 34.4 M 15.85%

Ablation studies reveal that the most effective configuration involves separately applying "shake" in both forward and backward passes at the image level (“Shake-Shake-Image”). Correlation analysis shows that shake-shake decorrelates branch outputs, promoting feature diversity. Removing skip connections still yields improvements (e.g., 4.05% with shake-shake versus 4.84% baseline), indicating the method's applicability beyond standard ResNet topologies. For architectures without Batch Normalization, restricting sampling coefficient ranges (e.g., Fk(xi;Wik)F_k(x_i; W_i^k)7) stabilizes training, but excessive randomness may cause divergence.

5. Theoretical Implications and Analysis

Shake-shake regularization introduces stochasticity via dual randomization: in the forward pass, each example is subjected to a random affine blend of branch features, and in the backward pass, the gradient directions are perturbed through independently resampled coefficients. This mechanism is analogous to internal data augmentation and implicit gradient noise, both of which are associated with improved generalization. Decorrelating the branches enforces specialization and robustness in the learned representations.

Empirically, shake-shake regularization consistently reduces test error against comparably parameterized baselines and competes favorably with substantially larger models. Open research questions remain regarding the precise impact of branch alignment dynamics, optimal coefficient distributions, and scalability to extremely deep or domain-shifted models.

6. Implementation Considerations

Shake-shake regularization is distributed under an open-source license via [https://github.com/xgastaldi/shake-shake], building on fb.resnet.torch. Key aspects:

  • Coefficients are refreshed immediately preceding each forward and backward pass.
  • Image-level coefficient sampling exerts stronger regularization than batch-level and is recommended for smaller datasets.
  • At inference, coefficients revert to deterministic expectations.
  • Training without BatchNorm or skip connections requires careful hyperparameter tuning, notably narrowing coefficient ranges and appropriately selecting batch size and learning rate schedule.

A plausible implication is that the method's lightweight, modular nature facilitates application in diverse multi-branch architectures, potentially extending to tasks beyond standard image classification (Gastaldi, 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shake-shake Regularization Models.