DEEP R Algorithm (Sparse Training & RL)
- DEEP R algorithm is a dual framework that encompasses sparse neural network training through continual rewiring and average-reward reinforcement learning via dueling networks.
- In sparse training, the algorithm maintains a fixed number of active connections using Bayesian methods and stochastic updates to reinforce task-relevant links.
- The reinforcement learning variant employs differential TD updates and a dueling architecture to robustly approximate average rewards in high-dimensional environments.
The term "DEEP R algorithm" encompasses two unrelated frameworks in machine learning and optimization: (1) Deep Rewiring (DEEP R) for sparse neural network training, and (2) Deep R-learning (particularly in the “dueling deep R-network”) for average-reward reinforcement learning. Each represents a distinct methodology, application domain, and theoretical foundation.
1. Deep Rewiring (DEEP R) for Sparse Neural Network Training
DEEP R, introduced by Bellec, Salaj, Subramoney, et al., addresses the challenge of training deep networks under strict connectivity constraints. The algorithm enforces a fixed and exactly bounded number of active connections, maintaining high performance with extreme sparsity without requiring a dense model at any point (Bellec et al., 2017).
1.1 Model Definition and Parametrization
A neural network is parameterized by , one parameter per potential connection, each with fixed sign . Actual connection weight is defined as:
A strict budget is enforced: exactly active (nonzero) connections throughout training.
1.2 Training Procedure
At each iteration:
- Gradient and Noise Update on Active Connections: For all active , an update is performed:
where: - : regularized loss (e.g., cross-entropy + ) - : learning rate - : -regularization coefficient - : temperature controlling noise magnitude - (independent)
- Deactivate Dormant Connections: If post-update, set , connection becomes dormant.
- Re-activate to Maintain Exact Sparsity: If the active count drops below , uniformly sample dormant indices and set newly activated until active.
This delete-then-regrow operation instantly adapts the network topology, targeting task-relevant connectivity.
1.3 Bayesian Formulation and Theoretical Guarantees
DEEP R is grounded in a Bayesian framework:
enforcing:
- -prior on
- uniform prior on all binary masks such that ,
- constraint that implies .
Two Markov processes are proven:
- Soft-DEEP R samples tempered posterior without hard budget (Theorem 1).
- DEEP R (with hard -budget): The joint Markov chain over is shown to have a unique invariant distribution exactly matching the constrained Bayesian posterior (Theorem 2).
1.4 Algorithmic Summary (Pseudocode)
- For each with , update as above
- If , deactivate
- While number of actives , select dormant , set and activate
1.5 Enforcing and Preserving Exact Sparsity
Sparsity is preserved by an implicit mask: . Any time crosses zero, the connection immediately becomes dormant and another dormant connection is randomly reactivated, strictly maintaining .
1.6 Hyperparameterization
- (learning rate); e.g., 0.05 for MNIST, Adam with for TIMIT LSTM
- (-regularization)
- (“temperature”): is almost deterministic; maintains Bayesian exploration.
- : enforced sparsity budget
Tuning and for desired sparsity and hardware constraints is standard. is robust, can be annealed or held constant.
1.7 Empirical Results and Comparison
Extensive experiments show DEEP R matches or outperforms post-hoc pruning, -shrinkage, and fixed-mask approaches at strict sparsity budgets, with particularly strong performance in the highly sparse regime and for recurrent networks. Key results (see (Bellec et al., 2017), Table below):
| Task | MNIST (1%) | MNIST (10%) | CIFAR-10 (5%) | CIFAR-10 (20%) | TIMIT LSTM (10%) | TIMIT LSTM (20%) |
|---|---|---|---|---|---|---|
| Fully connected | 98.2% | 98.2% | 86.5% | 86.5% | 28.3% | 28.3% |
| Post-hoc pruning | 96.1% | 97.5% | 84.0% | 86.0% | 29.0% | 28.5% |
| -shrinkage | 95.8% | 97.2% | 83.8% | 85.8% | 29.3% | 28.9% |
| Fixed random mask | 90.2% | 96.0% | 80.1% | 85.5% | 30.1% | 28.9% |
| DEEP R | 96.3% | 97.8% | 84.1% | 86.3% | 27.9% | 28.4% |
Significant findings:
- Only DEEP R and soft-DEEP R maintain performance as decreases to extreme sparsity; pruning and shrinkage fail
- For LSTM recurrent networks, DEEP R avoids large error jumps observed in pruning
- Continual rewiring supports online re-tasking and feature transfer across tasks (Bellec et al., 2017)
2. Deep R-Learning and Dueling Deep R-Networks in Reinforcement Learning
A separate line of research refers to “Deep R-network” or “dueling deep R-network” (DDR), combining R-learning with deep neural function approximators in average-reward reinforcement learning (Xu et al., 2021).
2.1 MDP and Average-Reward Problem Structure
The framework targets continuing, undiscounted MDPs :
- State: Summary of Age-of-Information (AoI) at sensors and users
- Action: Sensor activation subject to -sensor simultaneous update constraint
- Reward: Negative of weighted AoI and energy cost
- Objective: Maximize
2.2 R-Learning Objective and Update Equations
For undiscounted average-reward, R-learning uses the differential action-value:
The temporal-difference update is:
is a running estimate of the bias (average reward), updated via minibatch-accumulated TD-errors.
2.3 Dueling Deep R-Network Structure
Function approximation is handled as in dueling DQNs:
- Two separate streams learn value and advantage, stabilizing and accelerating learning
A target network and experience replay buffer are maintained for stable deep RL training.
2.4 Pseudocode Outline
Key steps:
- Initialize experience replay, network parameters (, ), target network, average reward
- For each step :
- With probability , choose random action; else, maximize
- Execute action, observe next state, cost, compute
- Store transition in replay buffer
- Once buffer populated, sample minibatch, compute TD-error for transitions, update average reward and perform gradient descent on MSE loss
- Update target network periodically
2.5 Addressing High Dimensionality and Unknown Dynamics
DDR addresses exponential state-action complexity via:
Deep neural generalization
- Experience replay buffer
- Target network stabilization
- Dueling value-advantage decomposition
- Model-free R-learning
2.6 Empirical Performance on IoT Caching
On the status update optimization task (8 sensors, up to 48 users), DDR outperforms DR-DSU, dueling DQNs, vanilla DQNs, and heuristic/random policies. It achieves higher mean average reward and faster convergence. Sample table (N=24 users) (Xu et al., 2021):
| Algorithm | Mean Avg. Reward | Std. Dev. |
|---|---|---|
| DDR-DSU | -36.59 | 0.19 |
| DR-DSU | -36.67 | 0.17 |
| DDQ-DSU | -38.38 | 0.34 |
| DQ-DSU | -38.35 | 0.30 |
DDR-based policies are robust to state-action space explosion and unknown environment dynamics.
3. Comparative Summary and Nomenclature
Despite similar names, DEEP R for sparse neural network training (Bellec et al., 2017) and deep R-learning for RL (Xu et al., 2021) are unrelated algorithmic paradigms:
- DEEP R (Bellec et al., 2017): Enforces strict network sparsity during supervised training via continual stochastic rewiring and Bayesian posterior sampling.
- Dueling Deep R-Network (DDR): Solves average-reward reinforcement learning via differential TD updates, deep function approximation, and dueling architecture.
Both algorithms are strongly supported by theoretical and empirical analysis in their respective domains, but should not be conflated due to lack of methodological overlap beyond the use of stochastic optimization and the moniker "R." Each addresses different classes of modern machine learning optimization problems.