DEEP R Algorithm (Sparse Training & RL)

Updated 28 January 2026

DEEP R algorithm is a dual framework that encompasses sparse neural network training through continual rewiring and average-reward reinforcement learning via dueling networks.
In sparse training, the algorithm maintains a fixed number of active connections using Bayesian methods and stochastic updates to reinforce task-relevant links.
The reinforcement learning variant employs differential TD updates and a dueling architecture to robustly approximate average rewards in high-dimensional environments.

The term "DEEP R algorithm" encompasses two unrelated frameworks in machine learning and optimization: (1) Deep Rewiring (DEEP R) for sparse neural network training, and (2) Deep R-learning (particularly in the “dueling deep R-network”) for average-reward reinforcement learning. Each represents a distinct methodology, application domain, and theoretical foundation.

1. Deep Rewiring (DEEP R) for Sparse Neural Network Training

DEEP R, introduced by Bellec, Salaj, Subramoney, et al., addresses the challenge of training deep networks under strict connectivity constraints. The algorithm enforces a fixed and exactly bounded number of active connections, maintaining high performance with extreme sparsity without requiring a dense model at any point (Bellec et al., 2017).

1.1 Model Definition and Parametrization

A neural network is parameterized by $\theta=(\theta_1,\dots,\theta_M)$ , one parameter per potential connection, each with fixed sign $s_k\in\{-1,+1\}$ . Actual connection weight is defined as:

$w_k = \begin{cases} s_k\,\theta_k, & \theta_k\geq 0\quad\text{(active)} \ 0, & \theta_k < 0\quad\text{(dormant)} \end{cases}$

A strict budget $K$ is enforced: exactly $K$ active (nonzero) connections throughout training.

1.2 Training Procedure

At each iteration:

Gradient and Noise Update on Active Connections: For all active $k$ , an update is performed:

$\theta_k \leftarrow \theta_k - \eta \frac{\partial L(\theta)}{\partial \theta_k} - \eta \alpha + \sqrt{2\eta T} \nu_k$

where: - $L(\theta)$ : regularized loss (e.g., cross-entropy + $\alpha \|\theta\|_1$ ) - $\eta$ : learning rate - $\alpha$ : $\ell_1$ -regularization coefficient - $T$ : temperature controlling noise magnitude - $\nu_k \sim \mathcal{N}(0,1)$ (independent)

Deactivate Dormant Connections: If $\theta_k < 0$ post-update, set $w_k = 0$ , connection becomes dormant.
Re-activate to Maintain Exact Sparsity: If the active count drops below $K$ , uniformly sample dormant indices and set newly activated $\theta_{k'}\gets 0$ until $K$ active.

This delete-then-regrow operation instantly adapts the network topology, targeting task-relevant connectivity.

1.3 Bayesian Formulation and Theoretical Guarantees

DEEP R is grounded in a Bayesian framework:

$p^*(\theta, c \mid D) \propto p(Y^*|X, \theta) p(\theta) \mathcal{C}(\theta, c) p_\mathcal{C}(c)$

enforcing:

$\ell_1$ -prior on $\theta$
uniform prior on all binary masks $c$ such that $\sum_k c_k = K$ ,
constraint $\mathcal{C}(\theta, c)$ that $c_k=0$ implies $\theta_k<0$ .

Two Markov processes are proven:

Soft-DEEP R samples tempered posterior without hard budget (Theorem 1).
DEEP R (with hard $K$ -budget): The joint Markov chain over $(\theta, c)$ is shown to have a unique invariant distribution exactly matching the constrained Bayesian posterior (Theorem 2).

1.4 Algorithmic Summary (Pseudocode)

$\boxed{ \text{Initialize } \theta_k \sim \mathcal{U}(\theta_{min},0)\ \forall k; \text{ activate } K \text{ connections } (\theta_k \geq 0) }$

$\text{Repeat:}$

For each $k$ with $\theta_k \geq 0$ , update as above
If $\theta_k < 0$ , deactivate
While number of actives $< K$ , select dormant $k'$ , set $\theta_{k'}\gets 0$ and activate

1.5 Enforcing and Preserving Exact Sparsity

Sparsity is preserved by an implicit mask: $c_k = 1 \iff \theta_k \geq 0$ . Any time $\theta_k$ crosses zero, the connection immediately becomes dormant and another dormant connection is randomly reactivated, strictly maintaining $\sum_k c_k = K$ .

1.6 Hyperparameterization

$\eta$ (learning rate); e.g., 0.05 for MNIST, Adam with $10^{-2}$ for TIMIT LSTM
$\alpha$ ( $\ell_1$ -regularization)
$T$ (“temperature”): $T=0$ is almost deterministic; $T>0$ maintains Bayesian exploration.
$K$ : enforced sparsity budget

Tuning $\alpha$ and $K$ for desired sparsity and hardware constraints is standard. $T$ is robust, can be annealed or held constant.

1.7 Empirical Results and Comparison

Extensive experiments show DEEP R matches or outperforms post-hoc pruning, $\ell_1$ -shrinkage, and fixed-mask approaches at strict sparsity budgets, with particularly strong performance in the highly sparse regime and for recurrent networks. Key results (see (Bellec et al., 2017), Table below):

Task	MNIST (1%)	MNIST (10%)	CIFAR-10 (5%)	CIFAR-10 (20%)	TIMIT LSTM (10%)	TIMIT LSTM (20%)
Fully connected	98.2%	98.2%	86.5%	86.5%	28.3%	28.3%
Post-hoc pruning	96.1%	97.5%	84.0%	86.0%	29.0%	28.5%
$\ell_1$ -shrinkage	95.8%	97.2%	83.8%	85.8%	29.3%	28.9%
Fixed random mask	90.2%	96.0%	80.1%	85.5%	30.1%	28.9%
DEEP R	96.3%	97.8%	84.1%	86.3%	27.9%	28.4%

Significant findings:

Only DEEP R and soft-DEEP R maintain performance as $K$ decreases to extreme sparsity; pruning and shrinkage fail
For LSTM recurrent networks, DEEP R avoids large error jumps observed in pruning
Continual rewiring supports online re-tasking and feature transfer across tasks (Bellec et al., 2017)

2. Deep R-Learning and Dueling Deep R-Networks in Reinforcement Learning

A separate line of research refers to “Deep R-network” or “dueling deep R-network” (DDR), combining R-learning with deep neural function approximators in average-reward reinforcement learning (Xu et al., 2021).

2.1 MDP and Average-Reward Problem Structure

The framework targets continuing, undiscounted MDPs $(\mathbb S, \mathbb A, \mathsf{Pr}, U)$ :

State: Summary of Age-of-Information (AoI) at sensors and users
Action: Sensor activation subject to $M$ -sensor simultaneous update constraint
Reward: Negative of weighted AoI and energy cost

$U(\mathcal{S}, \mathbf{A}) = -(\beta_1 C_\Delta + \beta_2 C_E)$

Objective: Maximize $\rho_\pi = \lim_{T\to\infty}\frac{1}{T}\mathbb{E}[\sum_{t=1}^T U(\mathcal{S}(t), \pi(\mathcal{S}(t)))]$

2.2 R-Learning Objective and Update Equations

For undiscounted average-reward, R-learning uses the differential action-value: $R_\pi(\mathcal{S}, \mathbf{A}) = \mathbb{E}_\pi\left[\sum_{l=0}^{\infty} (U(\mathcal{S}_{t+l}, \mathbf{A}_{t+l}) - \rho_\pi) \mid \mathcal{S}_t=\mathcal{S}, \mathbf{A}_t=\mathbf{A}\right]$

The temporal-difference update is: $\delta = U - \rho + \max_{\mathbf{A}'} R(\mathcal{S}', \mathbf{A}') - R(\mathcal{S}, \mathbf{A})$

$R(\mathcal{S}, \mathbf{A}) \gets R(\mathcal{S}, \mathbf{A}) + \alpha \delta$

$\rho$ is a running estimate of the bias (average reward), updated via minibatch-accumulated TD-errors.

2.3 Dueling Deep R-Network Structure

Function approximation is handled as in dueling DQNs:

$R(\mathcal{S}, \mathbf{A}; \theta_1,\theta_2) = V(\mathcal{S}; \theta_1) + [G(\mathcal{S}, \mathbf{A}; \theta_2) - \frac{1}{|\mathbb A|} \sum_{\mathbf{A}'} G(\mathcal{S}, \mathbf{A}'; \theta_2)]$
Two separate streams learn value and advantage, stabilizing and accelerating learning

A target network and experience replay buffer are maintained for stable deep RL training.

2.4 Pseudocode Outline

Key steps:

Initialize experience replay, network parameters ( $\theta_1$ , $\theta_2$ ), target network, average reward $\bar U$
For each step $t$ $t$ :
1. With probability $\epsilon$ , choose random action; else, maximize $R$
2. Execute action, observe next state, cost, compute $U_t=-C_t$
3. Store transition in replay buffer
4. Once buffer populated, sample minibatch, compute TD-error for transitions, update average reward and perform gradient descent on MSE loss
5. Update target network periodically

2.5 Addressing High Dimensionality and Unknown Dynamics

DDR addresses exponential state-action complexity via:

Deep neural generalization
Experience replay buffer
Target network stabilization
Dueling value-advantage decomposition
Model-free R-learning

2.6 Empirical Performance on IoT Caching

On the status update optimization task (8 sensors, up to 48 users), DDR outperforms DR-DSU, dueling DQNs, vanilla DQNs, and heuristic/random policies. It achieves higher mean average reward and faster convergence. Sample table (N=24 users) (Xu et al., 2021):

Algorithm	Mean Avg. Reward	Std. Dev.
DDR-DSU	-36.59	0.19
DR-DSU	-36.67	0.17
DDQ-DSU	-38.38	0.34
DQ-DSU	-38.35	0.30

DDR-based policies are robust to state-action space explosion and unknown environment dynamics.

3. Comparative Summary and Nomenclature

Despite similar names, DEEP R for sparse neural network training (Bellec et al., 2017) and deep R-learning for RL (Xu et al., 2021) are unrelated algorithmic paradigms:

DEEP R (Bellec et al., 2017): Enforces strict network sparsity during supervised training via continual stochastic rewiring and Bayesian posterior sampling.
Dueling Deep R-Network (DDR): Solves average-reward reinforcement learning via differential TD updates, deep function approximation, and dueling architecture.

Both algorithms are strongly supported by theoretical and empirical analysis in their respective domains, but should not be conflated due to lack of methodological overlap beyond the use of stochastic optimization and the moniker "R." Each addresses different classes of modern machine learning optimization problems.

Markdown Report Issue Upgrade to Chat

References (2)

Deep Rewiring: Training very sparse deep networks (2017)

Optimal Status Update for Caching Enabled IoT Networks: A Dueling Deep R-Network Approach (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DEEP R Algorithm.

DEEP R Algorithm (Sparse Training & RL)

1. Deep Rewiring (DEEP R) for Sparse Neural Network Training

1.1 Model Definition and Parametrization

1.2 Training Procedure

1.3 Bayesian Formulation and Theoretical Guarantees

1.4 Algorithmic Summary (Pseudocode)

1.5 Enforcing and Preserving Exact Sparsity

1.6 Hyperparameterization

1.7 Empirical Results and Comparison

2. Deep R-Learning and Dueling Deep R-Networks in Reinforcement Learning

2.1 MDP and Average-Reward Problem Structure

2.2 R-Learning Objective and Update Equations

2.3 Dueling Deep R-Network Structure

2.4 Pseudocode Outline

2.5 Addressing High Dimensionality and Unknown Dynamics

2.6 Empirical Performance on IoT Caching

3. Comparative Summary and Nomenclature

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DEEP R Algorithm (Sparse Training & RL)

1. Deep Rewiring (DEEP R) for Sparse Neural Network Training

1.1 Model Definition and Parametrization

1.2 Training Procedure

1.3 Bayesian Formulation and Theoretical Guarantees

1.4 Algorithmic Summary (Pseudocode)

1.5 Enforcing and Preserving Exact Sparsity

1.6 Hyperparameterization

1.7 Empirical Results and Comparison

2. Deep R-Learning and Dueling Deep R-Networks in Reinforcement Learning

2.1 MDP and Average-Reward Problem Structure

2.2 R-Learning Objective and Update Equations

2.3 Dueling Deep R-Network Structure

2.4 Pseudocode Outline

2.5 Addressing High Dimensionality and Unknown Dynamics

2.6 Empirical Performance on IoT Caching

3. Comparative Summary and Nomenclature

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research