Shake-Shake Regularization in Residual Networks

Updated 7 June 2026

Shake-shake regularization is a stochastic technique for multi-branch deep neural networks that replaces fixed summation with random affine combinations to boost generalization.
It introduces independent random coefficients in both the forward and backward passes, effectively decorrelating branch outputs and reducing overfitting.
Empirical evidence on CIFAR-10 and CIFAR-100 demonstrates significant test error reduction and showcases the method’s robustness even in architectures without BatchNorm or skip connections.

Shake-shake regularization is a stochastic regularization technique designed for multi-branch deep neural networks, especially residual architectures, to mitigate overfitting by replacing the deterministic summation of parallel residual branches with a randomly sampled affine combination during training. Introduced in the context of 3-branch networks, Shake-shake regularization achieves improved test error on standard benchmarks, outperforming prominent contemporary architectures, and demonstrates effectiveness even in the absence of architectural components such as skip connections or Batch Normalization (Gastaldi, 2017).

1. Mathematical Formulation

Shake-shake regularization modifies the standard aggregation in multi-branch residual network blocks. Given a block input tensor $x_i$ and $B$ parallel branches with residual functions $F_k(x_i; W_i^k)$ , the classical update is

$x_{i+1} = x_i + \sum_{k=1}^{B} F_k(x_i; W_i^k).$

In shake-shake regularization, the output is instead

$x_{i+1} = x_i + \alpha_i F_1(x_i; W_i^1) + \beta_i F_2(x_i; W_i^2) + \gamma_i F_3(x_i; W_i^3),$

under the constraint $\alpha_i + \beta_i + \gamma_i = 1$ for the 3-branch case. The coefficients $(\alpha_i, \beta_i, \gamma_i)$ are sampled from the uniform simplex by drawing independently from Uniform(0,1) and normalizing or using a Dirichlet $(1,1,1)$ distribution. Immediately before the backward pass, the coefficients are re-sampled independently, introducing stochasticity into both activations and gradients.

For the gradient computation, if $L$ is the loss and $\delta_i = \partial L / \partial x_{i+1}$ , the gradient w.r.t. the $B$ 0-th residual output is scaled by the corresponding backward coefficient:

$B$ 1

where the backward coefficients are independently drawn, causing the effective gradient path to differ from the forward affine combination. At inference, coefficients are set deterministically to their mean values— $B$ 2 each for the 3-branch block—to recover standard ensemble-like deterministic behavior.

2. Algorithmic Implementation

A single training iteration for a residual block using shake-shake proceeds as follows:

Initialization: Initialize weights $B$ 3, set learning rate $B$ 4, select batch size $B$ 5.
Forward Pass: For each block $B$ 6, sample $B$ 7 Dirichlet $B$ 8 (either at batch-level or image-level). Compute block output as a weighted affine sum of branch outputs using these coefficients.
Loss Computation: Calculate training loss $B$ 9.
Backward Pass: For each block $F_k(x_i; W_i^k)$ 0, independently sample backward coefficients $F_k(x_i; W_i^k)$ 1 and perform gradient back-propagation, scaling the gradient to each branch by its sampled backward coefficient.
Parameter Update: Update weights via $F_k(x_i; W_i^k)$ 2 (typically using SGD with Nesterov momentum).

Coefficients can be sampled per block for the entire batch (batch-level) or per image (image-level), with image-level sampling providing stronger regularization.

3. Network Architectures and Hyperparameters

Shake-shake regularization is instantiated in both pre-activation 3-branch ResNet variants and ResNeXt-style models:

CIFAR-10 Setup: 26-layer pre-activation 3-branch ResNet; each block contains ReLU, Conv $F_k(x_i; W_i^k)$ 3, BatchNorm, repeated twice per branch; feature map sizes progress as $F_k(x_i; W_i^k)$ 4 with width doubling at each downsampling; variant labels such as "2×32d", "2×64d", "2×96d" denote initial filter counts.
CIFAR-100 Setup: 29-layer, 2 residual branch ResNeXt-style (each with 4 grouped convolutions, 64 channels per group, 34.4M parameters), without pre-activation.
Training Regimen: 1800 epochs; SGD with Nesterov momentum; initial learning rate 0.2, annealed to 0 via cosine schedule; batch sizes 128 (CIFAR-10) and 32 (CIFAR-100); standard data augmentation of random translations and horizontal flips.

4. Empirical Evaluation

Shake-shake regularization achieves state-of-the-art or near state-of-the-art results on CIFAR-10 and CIFAR-100. Key findings:

CIFAR-10 Test Error: Baseline 26-2×32d ResNet achieves 4.27%. Shake-Shake-Image on 26-2×96d achieves 2.86% (average over 5 runs), outperforming DenseNet-BC ( $F_k(x_i; W_i^k)$ 5) at 3.46% and ResNeXt-29,16×64d at 3.58%.
CIFAR-100 Test Error: Baseline ResNeXt-29,2×4×64d records 16.34%. Shake-Even-Image variant achieves 15.85% (average over 3 runs).

Method	Depth	Params	CIFAR-10	CIFAR-100
Wide ResNet	28	36.5 M	3.80%	18.30%
ResNeXt-29,16×64d	29	68.1 M	3.58%	17.31%
DenseNet-BC( $F_k(x_i; W_i^k)$ 6)	190	25.6 M	3.46%	17.18%
Shake-Shake-Image (C10)	26	26.2 M	2.86%	–
Shake-Even-Image (C100)	29	34.4 M	–	15.85%

Ablation studies reveal that the most effective configuration involves separately applying "shake" in both forward and backward passes at the image level (“Shake-Shake-Image”). Correlation analysis shows that shake-shake decorrelates branch outputs, promoting feature diversity. Removing skip connections still yields improvements (e.g., 4.05% with shake-shake versus 4.84% baseline), indicating the method's applicability beyond standard ResNet topologies. For architectures without Batch Normalization, restricting sampling coefficient ranges (e.g., $F_k(x_i; W_i^k)$ 7) stabilizes training, but excessive randomness may cause divergence.

5. Theoretical Implications and Analysis

Shake-shake regularization introduces stochasticity via dual randomization: in the forward pass, each example is subjected to a random affine blend of branch features, and in the backward pass, the gradient directions are perturbed through independently resampled coefficients. This mechanism is analogous to internal data augmentation and implicit gradient noise, both of which are associated with improved generalization. Decorrelating the branches enforces specialization and robustness in the learned representations.

Empirically, shake-shake regularization consistently reduces test error against comparably parameterized baselines and competes favorably with substantially larger models. Open research questions remain regarding the precise impact of branch alignment dynamics, optimal coefficient distributions, and scalability to extremely deep or domain-shifted models.

6. Implementation Considerations

Shake-shake regularization is distributed under an open-source license via [https://github.com/xgastaldi/shake-shake], building on fb.resnet.torch. Key aspects:

Coefficients are refreshed immediately preceding each forward and backward pass.
Image-level coefficient sampling exerts stronger regularization than batch-level and is recommended for smaller datasets.
At inference, coefficients revert to deterministic expectations.
Training without BatchNorm or skip connections requires careful hyperparameter tuning, notably narrowing coefficient ranges and appropriately selecting batch size and learning rate schedule.

A plausible implication is that the method's lightweight, modular nature facilitates application in diverse multi-branch architectures, potentially extending to tasks beyond standard image classification (Gastaldi, 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Shake-Shake regularization (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shake-shake Regularization Models.