Randomized Leaky ReLU (RReLU) Explained

Updated 26 March 2026

Randomized Leaky ReLU is a stochastic activation function that uses randomly sampled negative slopes to regularize neural network training and mitigate overfitting.
It samples slopes from a uniform distribution during training and switches to a deterministic expected value during inference, ensuring stability.
Empirical benchmarks show RReLU performs competitively with ReLU, Leaky ReLU, and PReLU, especially on small- to medium-scale vision datasets.

Randomized Leaky Rectified Linear Unit (RReLU) is a stochastic activation function for neural networks, specifically convolutional networks, designed to improve regularization and reduce overfitting. Unlike standard rectified linear unit (ReLU), which sets all negative inputs to zero, or Leaky ReLU (LReLU) and Parametric ReLU (PReLU), which use deterministic or learned negative slopes, RReLU introduces per-example, per-unit randomization of the negative slope during training. This stochasticity leverages noise as a form of regularization, particularly benefiting small- and medium-scale vision tasks by mitigating overfitting associated with fixed or freely learned negative slopes (Xu et al., 2015).

1. Mathematical Formulation

Let $x_{ji}$ denote the pre-activation of the $i$ -th feature map for the $j$ -th example. The RReLU function is defined as:

Training (Forward pass):

$f(x_{ji}) = \begin{cases} x_{ji}, & \text{if } x_{ji} \geq 0 \ a_{ji} \cdot x_{ji}, & \text{if } x_{ji} < 0 \end{cases}$

where each $a_{ji}$ is sampled independently from a uniform distribution:

$a_{ji} \sim \text{Uniform}(l, u), \quad 0 \leq l < u < 1$

Test (Inference):

$f_{\text{test}}(x_{ji}) = \begin{cases} x_{ji}, & \text{if } x_{ji} \geq 0 \ \mathbb{E}[a_{ji}] \cdot x_{ji}, & \text{if } x_{ji} < 0 \end{cases}$

where $\mathbb{E}[a_{ji}] = (l+u)/2$ .

The derivative with respect to $x_{ji}$ is:

$\frac{\partial f}{\partial x_{ji}} = \begin{cases} 1, & x_{ji} \geq 0 \ a_{ji}, & x_{ji} < 0 \end{cases}$

During training, the same $a_{ji}$ used in the forward pass is reused for the backward pass gradients. For test-time fine-tuning, $a_{ji}$ is set to $(l+u)/2$ for negative inputs (Xu et al., 2015).

2. Training and Test-Time Behavior

RReLU's operation differs between training and testing phases, embedding stochasticity only in the former:

Training: For every mini-batch, negative slopes $a_{ji}$ are sampled independently for each unit or channel. This introduces per-activation noise, compelling the network to develop robustness against variations in activation shape.
Testing: The negative slope is fixed to its expected value $(l+u)/2$ . This transforms RReLU into a deterministic variant, analogous to the relationship between Dropout during training and its deterministic test-time scaling (Xu et al., 2015).

This dual behavior allows RReLU to act as a regularizer during learning and as a stable function during inference.

3. Hyperparameter Recommendations

The core hyperparameters are the lower and upper bounds, $l$ and $u$ , of the uniform distribution from which $a_{ji}$ is sampled. Practical recommendations include:

Default settings: $l = 0.3$ and $u = 0.8$ , yielding a test-time slope of $(l+u)/2 = 0.55$ .
Empirical foundation: These values correspond to those used in the winning solution of the Kaggle National Data Science Bowl and were found effective across CIFAR-10, CIFAR-100, and the NDSB plankton classification task.
Tuning: $(l,u)$ can be adjusted based on dataset and model characteristics to control the degree of “leakiness.” Narrower intervals (e.g., $0.4 \to 0.6$ ) or broader (e.g., $0.2 \to 0.9$ ) slopes can be explored.
No additional learnable parameters: In contrast to PReLU, RReLU does not require optimization of any extra parameters, maintaining model simplicity (Xu et al., 2015).

4. Theoretical Rationale

Several mechanisms underlie the improved generalization conferred by RReLU:

Implicit regularization: The injected noise in the negative activation slope prevent the network from over-relying on a single activation configuration, decorrelating feature detectors in a manner similar to Dropout or randomized pooling.
Resistance to overfitting: On small-scale datasets (such as CIFAR with 50K samples), Leaky ReLU and learned PReLU can overfit by locking onto fixed or overly adaptive slopes. RReLU perturbs the decision boundaries per example, discouraging co-adaptation and overfitting.
Non-zero gradients for $x<0$ : RReLU, like Leaky ReLU and PReLU, avoids the “dead-unit” phenomenon of ReLU by ensuring negative activations propagate nonzero gradients. The additional randomness further enhances robustness (Xu et al., 2015).

5. Empirical Benchmarking and Comparisons

Extensive benchmarking on standardized vision datasets reveals the effectiveness of RReLU relative to other rectified activations. The following table summarizes core empirical results from (Xu et al., 2015):

Dataset/Setting	Test Error (Lower is Better)	Test Accuracy
ReLU	42.95% (CIFAR-100)	57.05%
Leaky ReLU ( $a$ =0.01)	≈42.05%
Leaky ReLU ( $a$ =0.055)	≈40.42%
PReLU	≈41.63%
RReLU ( $l$ =0.3, $u$ =0.8)	40.25%	≈59.75%

On a Batch-Norm Inception-style architecture (beginning at inception-3a), RReLU achieves 75.68% test accuracy on CIFAR-100 without ensemble or multi-view testing. On CIFAR-10, test errors are: ReLU 12.45%, Leaky ReLU (0.055) 11.20%, PReLU 11.79%, and RReLU 11.19%. The improvements on CIFAR-100 are more pronounced, indicating greater benefit in more difficult or data-scarce regimes (Xu et al., 2015).

6. Implementation Strategies and Best Practices

Key recommendations for adopting RReLU in practice include:

Dataset suitability: RReLU is particularly effective for small- to medium-scale datasets susceptible to overfitting.
Integration: During training, construct a mask of slopes $a_{ji}$ by per-unit sampling from Uniform( $l$ , $u$ ), multiplying negative $x_{ji}$ by their corresponding $a_{ji}$ . For backpropagation, reuse the sampled $a_{ji}$ for gradient scaling.
Inference: Set $a_{ji} \equiv (l+u)/2$ for deterministic evaluation, with no random sampling.
Batch-Norm compatibility: RReLU functions seamlessly alongside Batch-Normalization, with activation-induced noise showing no destabilizing impact.
Parameter efficiency: The method incurs no parameter count increase or optimization overhead relative to fixed- or learned-slope alternatives.

7. Comparative Perspective and Significance

RReLU challenges the assertion that activation sparsity, as enforced by ReLU, is a principal contributor to improved network performance. Empirical evidence demonstrates that, on several benchmarks, introducing a nonzero, randomized negative slope improves generalization, particularly when deterministic or parameterized counterparts are prone to overfitting. This suggests that scheduled or stochastic relaxation of the activation nonlinearity, especially in data-limited domains, offers a favorable regularization effect with negligible computational or architectural complexity (Xu et al., 2015).

Markdown Report Issue Upgrade to Chat

References (1)

Empirical Evaluation of Rectified Activations in Convolutional Network (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Leaky ReLU (RReLU).

Randomized Leaky ReLU (RReLU) Explained

1. Mathematical Formulation

2. Training and Test-Time Behavior

3. Hyperparameter Recommendations

4. Theoretical Rationale

5. Empirical Benchmarking and Comparisons

6. Implementation Strategies and Best Practices

7. Comparative Perspective and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Randomized Leaky ReLU (RReLU) Explained

1. Mathematical Formulation

2. Training and Test-Time Behavior

3. Hyperparameter Recommendations

4. Theoretical Rationale

5. Empirical Benchmarking and Comparisons

6. Implementation Strategies and Best Practices

7. Comparative Perspective and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research