Papers
Topics
Authors
Recent
Search
2000 character limit reached

DEEP R Algorithm (Sparse Training & RL)

Updated 28 January 2026
  • DEEP R algorithm is a dual framework that encompasses sparse neural network training through continual rewiring and average-reward reinforcement learning via dueling networks.
  • In sparse training, the algorithm maintains a fixed number of active connections using Bayesian methods and stochastic updates to reinforce task-relevant links.
  • The reinforcement learning variant employs differential TD updates and a dueling architecture to robustly approximate average rewards in high-dimensional environments.

The term "DEEP R algorithm" encompasses two unrelated frameworks in machine learning and optimization: (1) Deep Rewiring (DEEP R) for sparse neural network training, and (2) Deep R-learning (particularly in the “dueling deep R-network”) for average-reward reinforcement learning. Each represents a distinct methodology, application domain, and theoretical foundation.

1. Deep Rewiring (DEEP R) for Sparse Neural Network Training

DEEP R, introduced by Bellec, Salaj, Subramoney, et al., addresses the challenge of training deep networks under strict connectivity constraints. The algorithm enforces a fixed and exactly bounded number of active connections, maintaining high performance with extreme sparsity without requiring a dense model at any point (Bellec et al., 2017).

1.1 Model Definition and Parametrization

A neural network is parameterized by θ=(θ1,,θM)\theta=(\theta_1,\dots,\theta_M), one parameter per potential connection, each with fixed sign sk{1,+1}s_k\in\{-1,+1\}. Actual connection weight is defined as:

wk={skθk,θk0(active) 0,θk<0(dormant)w_k = \begin{cases} s_k\,\theta_k, & \theta_k\geq 0\quad\text{(active)} \ 0, & \theta_k < 0\quad\text{(dormant)} \end{cases}

A strict budget KK is enforced: exactly KK active (nonzero) connections throughout training.

1.2 Training Procedure

At each iteration:

  1. Gradient and Noise Update on Active Connections: For all active kk, an update is performed:

θkθkηL(θ)θkηα+2ηTνk\theta_k \leftarrow \theta_k - \eta \frac{\partial L(\theta)}{\partial \theta_k} - \eta \alpha + \sqrt{2\eta T} \nu_k

where: - L(θ)L(\theta): regularized loss (e.g., cross-entropy + αθ1\alpha \|\theta\|_1) - η\eta: learning rate - α\alpha: 1\ell_1-regularization coefficient - TT: temperature controlling noise magnitude - νkN(0,1)\nu_k \sim \mathcal{N}(0,1) (independent)

  1. Deactivate Dormant Connections: If θk<0\theta_k < 0 post-update, set wk=0w_k = 0, connection becomes dormant.
  2. Re-activate to Maintain Exact Sparsity: If the active count drops below KK, uniformly sample dormant indices and set newly activated θk0\theta_{k'}\gets 0 until KK active.

This delete-then-regrow operation instantly adapts the network topology, targeting task-relevant connectivity.

1.3 Bayesian Formulation and Theoretical Guarantees

DEEP R is grounded in a Bayesian framework:

p(θ,cD)p(YX,θ)p(θ)C(θ,c)pC(c)p^*(\theta, c \mid D) \propto p(Y^*|X, \theta) p(\theta) \mathcal{C}(\theta, c) p_\mathcal{C}(c)

enforcing:

  • 1\ell_1-prior on θ\theta
  • uniform prior on all binary masks cc such that kck=K\sum_k c_k = K,
  • constraint C(θ,c)\mathcal{C}(\theta, c) that ck=0c_k=0 implies θk<0\theta_k<0.

Two Markov processes are proven:

  • Soft-DEEP R samples tempered posterior without hard budget (Theorem 1).
  • DEEP R (with hard KK-budget): The joint Markov chain over (θ,c)(\theta, c) is shown to have a unique invariant distribution exactly matching the constrained Bayesian posterior (Theorem 2).

1.4 Algorithmic Summary (Pseudocode)

Initialize θkU(θmin,0) k; activate K connections (θk0)\boxed{ \text{Initialize } \theta_k \sim \mathcal{U}(\theta_{min},0)\ \forall k; \text{ activate } K \text{ connections } (\theta_k \geq 0) }

Repeat:\text{Repeat:}

  • For each kk with θk0\theta_k \geq 0, update as above
  • If θk<0\theta_k < 0, deactivate
  • While number of actives <K< K, select dormant kk', set θk0\theta_{k'}\gets 0 and activate

1.5 Enforcing and Preserving Exact Sparsity

Sparsity is preserved by an implicit mask: ck=1    θk0c_k = 1 \iff \theta_k \geq 0. Any time θk\theta_k crosses zero, the connection immediately becomes dormant and another dormant connection is randomly reactivated, strictly maintaining kck=K\sum_k c_k = K.

1.6 Hyperparameterization

  • η\eta (learning rate); e.g., 0.05 for MNIST, Adam with 10210^{-2} for TIMIT LSTM
  • α\alpha (1\ell_1-regularization)
  • TT (“temperature”): T=0T=0 is almost deterministic; T>0T>0 maintains Bayesian exploration.
  • KK: enforced sparsity budget

Tuning α\alpha and KK for desired sparsity and hardware constraints is standard. TT is robust, can be annealed or held constant.

1.7 Empirical Results and Comparison

Extensive experiments show DEEP R matches or outperforms post-hoc pruning, 1\ell_1-shrinkage, and fixed-mask approaches at strict sparsity budgets, with particularly strong performance in the highly sparse regime and for recurrent networks. Key results (see (Bellec et al., 2017), Table below):

Task MNIST (1%) MNIST (10%) CIFAR-10 (5%) CIFAR-10 (20%) TIMIT LSTM (10%) TIMIT LSTM (20%)
Fully connected 98.2% 98.2% 86.5% 86.5% 28.3% 28.3%
Post-hoc pruning 96.1% 97.5% 84.0% 86.0% 29.0% 28.5%
1\ell_1-shrinkage 95.8% 97.2% 83.8% 85.8% 29.3% 28.9%
Fixed random mask 90.2% 96.0% 80.1% 85.5% 30.1% 28.9%
DEEP R 96.3% 97.8% 84.1% 86.3% 27.9% 28.4%

Significant findings:

  • Only DEEP R and soft-DEEP R maintain performance as KK decreases to extreme sparsity; pruning and shrinkage fail
  • For LSTM recurrent networks, DEEP R avoids large error jumps observed in pruning
  • Continual rewiring supports online re-tasking and feature transfer across tasks (Bellec et al., 2017)

2. Deep R-Learning and Dueling Deep R-Networks in Reinforcement Learning

A separate line of research refers to “Deep R-network” or “dueling deep R-network” (DDR), combining R-learning with deep neural function approximators in average-reward reinforcement learning (Xu et al., 2021).

2.1 MDP and Average-Reward Problem Structure

The framework targets continuing, undiscounted MDPs (S,A,Pr,U)(\mathbb S, \mathbb A, \mathsf{Pr}, U):

  • State: Summary of Age-of-Information (AoI) at sensors and users
  • Action: Sensor activation subject to MM-sensor simultaneous update constraint
  • Reward: Negative of weighted AoI and energy cost

U(S,A)=(β1CΔ+β2CE)U(\mathcal{S}, \mathbf{A}) = -(\beta_1 C_\Delta + \beta_2 C_E)

  • Objective: Maximize ρπ=limT1TE[t=1TU(S(t),π(S(t)))]\rho_\pi = \lim_{T\to\infty}\frac{1}{T}\mathbb{E}[\sum_{t=1}^T U(\mathcal{S}(t), \pi(\mathcal{S}(t)))]

2.2 R-Learning Objective and Update Equations

For undiscounted average-reward, R-learning uses the differential action-value: Rπ(S,A)=Eπ[l=0(U(St+l,At+l)ρπ)St=S,At=A]R_\pi(\mathcal{S}, \mathbf{A}) = \mathbb{E}_\pi\left[\sum_{l=0}^{\infty} (U(\mathcal{S}_{t+l}, \mathbf{A}_{t+l}) - \rho_\pi) \mid \mathcal{S}_t=\mathcal{S}, \mathbf{A}_t=\mathbf{A}\right]

The temporal-difference update is: δ=Uρ+maxAR(S,A)R(S,A)\delta = U - \rho + \max_{\mathbf{A}'} R(\mathcal{S}', \mathbf{A}') - R(\mathcal{S}, \mathbf{A})

R(S,A)R(S,A)+αδR(\mathcal{S}, \mathbf{A}) \gets R(\mathcal{S}, \mathbf{A}) + \alpha \delta

ρ\rho is a running estimate of the bias (average reward), updated via minibatch-accumulated TD-errors.

2.3 Dueling Deep R-Network Structure

Function approximation is handled as in dueling DQNs:

  • R(S,A;θ1,θ2)=V(S;θ1)+[G(S,A;θ2)1AAG(S,A;θ2)]R(\mathcal{S}, \mathbf{A}; \theta_1,\theta_2) = V(\mathcal{S}; \theta_1) + [G(\mathcal{S}, \mathbf{A}; \theta_2) - \frac{1}{|\mathbb A|} \sum_{\mathbf{A}'} G(\mathcal{S}, \mathbf{A}'; \theta_2)]
  • Two separate streams learn value and advantage, stabilizing and accelerating learning

A target network and experience replay buffer are maintained for stable deep RL training.

2.4 Pseudocode Outline

Key steps:

  • Initialize experience replay, network parameters (θ1\theta_1, θ2\theta_2), target network, average reward Uˉ\bar U
  • For each step tt:

    1. With probability ϵ\epsilon, choose random action; else, maximize RR
    2. Execute action, observe next state, cost, compute Ut=CtU_t=-C_t
    3. Store transition in replay buffer
    4. Once buffer populated, sample minibatch, compute TD-error for transitions, update average reward and perform gradient descent on MSE loss
    5. Update target network periodically

2.5 Addressing High Dimensionality and Unknown Dynamics

DDR addresses exponential state-action complexity via:

  • Deep neural generalization

  • Experience replay buffer
  • Target network stabilization
  • Dueling value-advantage decomposition
  • Model-free R-learning

2.6 Empirical Performance on IoT Caching

On the status update optimization task (8 sensors, up to 48 users), DDR outperforms DR-DSU, dueling DQNs, vanilla DQNs, and heuristic/random policies. It achieves higher mean average reward and faster convergence. Sample table (N=24 users) (Xu et al., 2021):

Algorithm Mean Avg. Reward Std. Dev.
DDR-DSU -36.59 0.19
DR-DSU -36.67 0.17
DDQ-DSU -38.38 0.34
DQ-DSU -38.35 0.30

DDR-based policies are robust to state-action space explosion and unknown environment dynamics.

3. Comparative Summary and Nomenclature

Despite similar names, DEEP R for sparse neural network training (Bellec et al., 2017) and deep R-learning for RL (Xu et al., 2021) are unrelated algorithmic paradigms:

  • DEEP R (Bellec et al., 2017): Enforces strict network sparsity during supervised training via continual stochastic rewiring and Bayesian posterior sampling.
  • Dueling Deep R-Network (DDR): Solves average-reward reinforcement learning via differential TD updates, deep function approximation, and dueling architecture.

Both algorithms are strongly supported by theoretical and empirical analysis in their respective domains, but should not be conflated due to lack of methodological overlap beyond the use of stochastic optimization and the moniker "R." Each addresses different classes of modern machine learning optimization problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DEEP R Algorithm.