NoisyNet-Dueling: Adaptive Exploration in Dueling DQNs

Updated 19 December 2025

The paper demonstrates that replacing ε-greedy with learned parametric noise in dueling DQN architecture significantly enhances exploration and overall agent performance.
It employs noisy linear layers in both value and advantage streams using independent and factorized Gaussian noise to enable gradient-based adaptation.
Empirical results on Atari games show median human-normalized scores rising from 132% to 172% and substantial gains in specific game benchmarks.

NoisyNet-Dueling is a variant of the dueling deep Q-network architecture (dueling DQN) in which learned, parametric noise is incorporated into the weights and biases of the final, fully connected layers of the value and advantage streams. This approach, introduced by Fortunato et al. (Fortunato et al., 2017), replaces conventional stochastic exploration mechanisms such as ε-greedy action selection with network-internal stochasticity, guided by parameters adapted through gradient descent. Empirical results on the Atari Learning Environment demonstrate substantial improvements in exploration efficiency and overall agent performance compared to deterministic dueling DQN baselines.

1. Dueling DQN Architecture Overview

The standard dueling DQN architecture is characterized by a convolutional feature extractor $f(x; \theta_{\mathrm{conv}})$ , producing an embedding $h = f(x; \theta_{\mathrm{conv}}) \in \mathbb{R}^d$ . The Q-value computation is decomposed into two streams:

The value stream computes a scalar $V(h; \theta_V)$ .
The advantage stream computes a vector $A(h; \theta_A) \in \mathbb{R}^{|\mathcal{A}|}$ , where $|\mathcal{A}|$ is the number of discrete actions.

These are combined to yield Q-values: $Q(x, a; \theta) = V(h; \theta_V) + A(h, a; \theta_A) - \frac{1}{|\mathcal{A}|} \sum_{b \in \mathcal{A}} A(h, b; \theta_A)$ This separation allows the model to estimate the state-value and action-specific advantages independently, improving learning stability and representation capacity.

2. Noisy Linear Layers: Parametric Noise Injection

NoisyNet-Dueling replaces the conventional final linear layers in both streams with noisy linear layers. Each such layer parameterizes weights and biases as follows: $\tilde{W} = \mu^W + \sigma^W \odot \varepsilon^W, \quad \tilde{b} = \mu^b + \sigma^b \odot \varepsilon^b$ so that the layer output is

$y = \tilde{W} x + \tilde{b}$

where $\mu^W, \sigma^W \in \mathbb{R}^{q \times p}$ , and $\mu^b, \sigma^b \in \mathbb{R}^q$ are trainable parameters. The noise variables $\varepsilon^W, \varepsilon^b$ are zero-mean random variables sampled afresh at each forward pass. Two noise families are considered:

Independent Gaussian noise: $\varepsilon^W_{ij}, \varepsilon^b_j \sim \mathcal{N}(0, 1)$ i.i.d.
Factorized Gaussian noise: For computational efficiency, draw $\varepsilon_i, \varepsilon_j \sim \mathcal{N}(0, 1)$ and set $\varepsilon^W_{ij} = f(\varepsilon_i) f(\varepsilon_j)$ , $\varepsilon^b_j = f(\varepsilon_j)$ , with $f(u) = \text{sgn}(u)\sqrt{|u|}$ .

3. Gradient-Based Learning of Noise Parameters

The full parameter set is denoted $\zeta = \{\mu, \sigma\}$ . The reinforcement learning objective $\bar{L}(\zeta)$ is the expected loss under the noise distribution: $\bar{L}(\zeta) = \mathbb{E}_{\varepsilon}[L_{RL}(\mu + \sigma \odot \varepsilon)]$ Gradients propagate through both $\mu$ and $\sigma$ (by the chain rule): $\nabla_\mu \bar{L} = \mathbb{E}_\varepsilon[\nabla_\theta L_{RL}(\theta)]_{\theta = \mu + \sigma \odot \varepsilon}$

$\nabla_\sigma \bar{L} = \mathbb{E}_\varepsilon[\nabla_\theta L_{RL}(\theta) \odot \varepsilon]$

A single Monte Carlo sample of $\varepsilon$ is used per gradient step in practice.

4. NoisyNet-Dueling Training Mechanics

Both the value and advantage streams terminate in noisy linear layers. At each time step, for online Q-value computation and action selection, as well as in the target network, independent noise samples are drawn. The forward computation is:

Compute $h = f(x; \theta_{\mathrm{conv}})$ .
For the value head:
- Draw $\varepsilon_V$ , compute $\tilde{W}_V, \tilde{b}_V$
- $V = \tilde{W}_V h + \tilde{b}_V$
For the advantage head:
- Draw $\varepsilon_A$ , compute $\tilde{W}_A, \tilde{b}_A$
- $A = \tilde{W}_A h + \tilde{b}_A$
Combine to obtain $Q(x, \cdot)$ as in the standard dueling formulation.

There is no $\epsilon$ -greedy policy; instead, exploration arises from the stochasticity in weights: $a_t = \arg\max_a Q(x_t, a; \zeta, \varepsilon)$ The loss for training incorporates the noisy target: $\bar{L}(\zeta) = \mathbb{E}_{\varepsilon, \varepsilon', (x, a, r, y) \sim D} \Bigl[ r + \gamma Q^-(y, b^*; \zeta^-, \varepsilon') - Q(x, a; \zeta, \varepsilon) \Bigr]^2$ where $b^* = \arg\max_b Q(y, b; \zeta, \varepsilon'')$ with independent noise samples for each network usage.

Other elements (experience replay, double-DQN updates, and target refresh) remain as in Wang et al. (2016).

5. Empirical Performance and Hyper-Parameters

On the suite of 57 Atari games, NoisyNet-Dueling demonstrates substantial empirical gains. Hyper-parameters for the noisy layers are as follows:

$\mu$ initialized uniformly: $\mu \sim U[-1/\sqrt{p}, +1/\sqrt{p}]$
$\sigma$ initialized: $\sigma = \sigma_0 / \sqrt{p}$ , with $\sigma_0=0.5$ for factorized noise

Other RL settings (optimizer, learning rate) are unchanged from the original dueling DQN. Results:

Median human-normalized score increased from 132% to 172% (+30%).
Mean score increased from 524% to 633%.
In approximately 40 out of 57 games, NoisyNet-Dueling outperformed vanilla dueling DQN—frequently with increases of 200–1000% in specific games such as Beam Rider, Asterix.
Ablation between independent and factorized noise demonstrated no loss in performance from the factorized, computationally cheaper variant.

6. Advantages and Limitations

Notable advantages of NoisyNet-Dueling include:

Automatic adaptation of exploration: $\sigma$ parameters tune exploratory noise, eliminating hand-crafted $\epsilon$ or entropy schedules.
Contextual exploration: State-dependent noise yields richer exploratory behavior than uniform randomization.
Low computational overhead: The approach only doubles the parameter count in noisy layers, with negligible sampling cost when using factorized noise.
General applicability: The methodology is compatible with architectures/trainers employing SGD, including DQN, dueling, A3C, DDPG, and C51.

Limitations are as follows:

No formal relationship to Bayesian posterior estimation; $\sigma$ adapts to minimize RL loss, not to model epistemic uncertainty.
Noise scale parameters can collapse to zero, making the layer deterministic—an undesirable local minimum in principle, though not empirically universal.
Increased gradient variance from noise sampling can slow convergence somewhat, but this is offset by the substantial benefits to exploration and final performance on Atari benchmarks.

NoisyNet-Dueling is built upon and evaluated in the context of multiple foundational works, including the dueling network architecture by Wang et al. (2016) and the canonical DQN by Mnih et al. (2015). The innovation of learned parameteric noise applies broadly, with extensions and direct applicability to a range of deep RL algorithms, as demonstrated in the reference study. This methodology reflects a broader trend toward adaptive, data-driven mechanisms for exploration, supplanting heuristic procedures such as ε-greedy and entropy regularization (Fortunato et al., 2017).

PDF Markdown Chat (Pro)

References (1)

Noisy Networks for Exploration (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to NoisyNet-Dueling.