Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Leaky ReLU (RReLU) Explained

Updated 26 March 2026
  • Randomized Leaky ReLU is a stochastic activation function that uses randomly sampled negative slopes to regularize neural network training and mitigate overfitting.
  • It samples slopes from a uniform distribution during training and switches to a deterministic expected value during inference, ensuring stability.
  • Empirical benchmarks show RReLU performs competitively with ReLU, Leaky ReLU, and PReLU, especially on small- to medium-scale vision datasets.

Randomized Leaky Rectified Linear Unit (RReLU) is a stochastic activation function for neural networks, specifically convolutional networks, designed to improve regularization and reduce overfitting. Unlike standard rectified linear unit (ReLU), which sets all negative inputs to zero, or Leaky ReLU (LReLU) and Parametric ReLU (PReLU), which use deterministic or learned negative slopes, RReLU introduces per-example, per-unit randomization of the negative slope during training. This stochasticity leverages noise as a form of regularization, particularly benefiting small- and medium-scale vision tasks by mitigating overfitting associated with fixed or freely learned negative slopes (Xu et al., 2015).

1. Mathematical Formulation

Let xjix_{ji} denote the pre-activation of the ii-th feature map for the jj-th example. The RReLU function is defined as:

  • Training (Forward pass):

f(xji)={xji,if xji0 ajixji,if xji<0f(x_{ji}) = \begin{cases} x_{ji}, & \text{if } x_{ji} \geq 0 \ a_{ji} \cdot x_{ji}, & \text{if } x_{ji} < 0 \end{cases}

where each ajia_{ji} is sampled independently from a uniform distribution:

ajiUniform(l,u),0l<u<1a_{ji} \sim \text{Uniform}(l, u), \quad 0 \leq l < u < 1

  • Test (Inference):

ftest(xji)={xji,if xji0 E[aji]xji,if xji<0f_{\text{test}}(x_{ji}) = \begin{cases} x_{ji}, & \text{if } x_{ji} \geq 0 \ \mathbb{E}[a_{ji}] \cdot x_{ji}, & \text{if } x_{ji} < 0 \end{cases}

where E[aji]=(l+u)/2\mathbb{E}[a_{ji}] = (l+u)/2.

The derivative with respect to xjix_{ji} is:

fxji={1,xji0 aji,xji<0\frac{\partial f}{\partial x_{ji}} = \begin{cases} 1, & x_{ji} \geq 0 \ a_{ji}, & x_{ji} < 0 \end{cases}

During training, the same ajia_{ji} used in the forward pass is reused for the backward pass gradients. For test-time fine-tuning, ajia_{ji} is set to (l+u)/2(l+u)/2 for negative inputs (Xu et al., 2015).

2. Training and Test-Time Behavior

RReLU's operation differs between training and testing phases, embedding stochasticity only in the former:

  • Training: For every mini-batch, negative slopes ajia_{ji} are sampled independently for each unit or channel. This introduces per-activation noise, compelling the network to develop robustness against variations in activation shape.
  • Testing: The negative slope is fixed to its expected value (l+u)/2(l+u)/2. This transforms RReLU into a deterministic variant, analogous to the relationship between Dropout during training and its deterministic test-time scaling (Xu et al., 2015).

This dual behavior allows RReLU to act as a regularizer during learning and as a stable function during inference.

3. Hyperparameter Recommendations

The core hyperparameters are the lower and upper bounds, ll and uu, of the uniform distribution from which ajia_{ji} is sampled. Practical recommendations include:

  • Default settings: l=0.3l = 0.3 and u=0.8u = 0.8, yielding a test-time slope of (l+u)/2=0.55(l+u)/2 = 0.55.
  • Empirical foundation: These values correspond to those used in the winning solution of the Kaggle National Data Science Bowl and were found effective across CIFAR-10, CIFAR-100, and the NDSB plankton classification task.
  • Tuning: (l,u)(l,u) can be adjusted based on dataset and model characteristics to control the degree of “leakiness.” Narrower intervals (e.g., 0.40.60.4 \to 0.6) or broader (e.g., 0.20.90.2 \to 0.9) slopes can be explored.
  • No additional learnable parameters: In contrast to PReLU, RReLU does not require optimization of any extra parameters, maintaining model simplicity (Xu et al., 2015).

4. Theoretical Rationale

Several mechanisms underlie the improved generalization conferred by RReLU:

  • Implicit regularization: The injected noise in the negative activation slope prevent the network from over-relying on a single activation configuration, decorrelating feature detectors in a manner similar to Dropout or randomized pooling.
  • Resistance to overfitting: On small-scale datasets (such as CIFAR with 50K samples), Leaky ReLU and learned PReLU can overfit by locking onto fixed or overly adaptive slopes. RReLU perturbs the decision boundaries per example, discouraging co-adaptation and overfitting.
  • Non-zero gradients for x<0x<0: RReLU, like Leaky ReLU and PReLU, avoids the “dead-unit” phenomenon of ReLU by ensuring negative activations propagate nonzero gradients. The additional randomness further enhances robustness (Xu et al., 2015).

5. Empirical Benchmarking and Comparisons

Extensive benchmarking on standardized vision datasets reveals the effectiveness of RReLU relative to other rectified activations. The following table summarizes core empirical results from (Xu et al., 2015):

Dataset/Setting Test Error (Lower is Better) Test Accuracy
ReLU 42.95% (CIFAR-100) 57.05%
Leaky ReLU (aa=0.01) ≈42.05%
Leaky ReLU (aa=0.055) ≈40.42%
PReLU ≈41.63%
RReLU (ll=0.3, uu=0.8) 40.25% ≈59.75%

On a Batch-Norm Inception-style architecture (beginning at inception-3a), RReLU achieves 75.68% test accuracy on CIFAR-100 without ensemble or multi-view testing. On CIFAR-10, test errors are: ReLU 12.45%, Leaky ReLU (0.055) 11.20%, PReLU 11.79%, and RReLU 11.19%. The improvements on CIFAR-100 are more pronounced, indicating greater benefit in more difficult or data-scarce regimes (Xu et al., 2015).

6. Implementation Strategies and Best Practices

Key recommendations for adopting RReLU in practice include:

  • Dataset suitability: RReLU is particularly effective for small- to medium-scale datasets susceptible to overfitting.
  • Integration: During training, construct a mask of slopes ajia_{ji} by per-unit sampling from Uniform(ll, uu), multiplying negative xjix_{ji} by their corresponding ajia_{ji}. For backpropagation, reuse the sampled ajia_{ji} for gradient scaling.
  • Inference: Set aji(l+u)/2a_{ji} \equiv (l+u)/2 for deterministic evaluation, with no random sampling.
  • Batch-Norm compatibility: RReLU functions seamlessly alongside Batch-Normalization, with activation-induced noise showing no destabilizing impact.
  • Parameter efficiency: The method incurs no parameter count increase or optimization overhead relative to fixed- or learned-slope alternatives.

7. Comparative Perspective and Significance

RReLU challenges the assertion that activation sparsity, as enforced by ReLU, is a principal contributor to improved network performance. Empirical evidence demonstrates that, on several benchmarks, introducing a nonzero, randomized negative slope improves generalization, particularly when deterministic or parameterized counterparts are prone to overfitting. This suggests that scheduled or stochastic relaxation of the activation nonlinearity, especially in data-limited domains, offers a favorable regularization effect with negligible computational or architectural complexity (Xu et al., 2015).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Leaky ReLU (RReLU).