ITDQN: Imitation-Based Triple Deep Q-Network

Updated 28 December 2025

The paper introduces a tri-network framework that combines online, target, and mediator Q-networks with an elite imitation mechanism to enhance learning.
ITDQN improves sample efficiency and reduces overestimation bias, achieving higher weed recognition and data collection rates in simulated smart agriculture.
The approach enables rapid policy convergence and robust performance under partial observability and UAV battery constraints using value-ensemble strategies.

The Imitation-Based Triple Deep Q-Network (ITDQN) is a multi-agent reinforcement learning (MARL) algorithm that extends Double Deep Q-Networks (DDQN) by introducing a third, mediator Q-network and an elite imitation mechanism. ITDQN addresses trajectory planning challenges for unmanned aerial vehicles (UAVs) in smart agriculture, particularly under partial observability, stochastic environments, and battery limitations. Its architecture integrates value-ensemble techniques with parameter-level policy mimicry, improving sample efficiency, exploration, and policy stability compared to conventional DQN and DDQN approaches (Mao et al., 21 Dec 2025).

1. Formal Problem Specification

The ITDQN framework formulates the UAV trajectory problem as a Markov decision process (MDP) on an $N\times N$ agricultural grid. Multiple UAVs ( $n_{\rm UAV}$ agents) operate in discrete episodes of length $T_{\max}$ .

State Space $\mathcal S$ : Each UAV $i$ at time $t$ receives a partial state,

$s_{i,t} = (x_{i,t}, y_{i,t}, \text{weed‐FoV}_{i,t}, \rho_{i,t}, d^{\rm sensor}_{i,t}, \theta^{\rm sensor}_{i,t}, d^{\rm UAV}_{i,t})$

where $(x_{i,t}, y_{i,t})$ is spatial location; $\text{weed‐FoV}_{i,t}$ is a $3 \times 3$ grid of weed data; $\rho_{i,t}$ is weed density; $d^{\rm sensor}_{i,t}$ and $\theta^{\rm sensor}_{i,t}$ give distance/direction to the nearest sensor; $d^{\rm UAV}_{i,t}$ gives inter-UAV distance. Partial observability emerges from limited camera field-of-view and randomized weed/sensor placement.

Action Space $\mathcal A$ : Eight discrete headings: $\{\text{N, NE, E, SE, S, SW, W, NW}\}$ .
Reward Function:

$r_{i,t} = -P_{\rm out} - P_{\rm bat} - P_{\rm clo} + I_{\rm weed} + I_{\rm data} + I_{\rm exploit} + I_{\rm explore} + b$

with penalties for boundary violations ( $P_{\rm out}$ ), battery exhaustion ( $P_{\rm bat}$ ), and unsafe proximity ( $P_{\rm clo}$ ); incentives for successful weed detection ( $I_{\rm weed}$ ) and sensor data collection ( $I_{\rm data}$ ); exploration/exploitation bonuses ( $I_{\rm exploit}$ , $I_{\rm explore}$ ); and a bias $b$ .

2. ITDQN Architecture and Mechanisms

ITDQN introduces three discrete Q-value networks per agent:

$Q_1$ (Primary/Online network): Parameterized by $\theta$ .
$Q_m$ (Mediator network): Parameterized by $\theta_m$ .
$Q_2$ (Target network): Parameterized by $\theta'$ .

Action Selection: The agent forms a Gaussian-distributed Q-value,

$Q_{\rm online}(s,a) \sim \mathcal N\left(\frac{1}{2}(Q_1(s,a;\theta) + Q_m(s,a;\theta_m)),\, \sigma^2\right)$

and acts greedily: $a^* = \arg\max_a Q_{\rm online}(s,a)$ .

Target Value Computation: For a transition $(s_k, a_k, r_k, s'_k, d_k)$ ,

$Q_{\rm target}(s, a) \sim \mathcal N\left( \frac{1}{2} (Q_2(s,a;\theta') + Q_m(s,a;\theta_m)),\, \sigma^2 \right)$

$y_k = r_k + (1 - d_k) \gamma Q_{\rm target}\big( s'_k, \arg\max_a Q_{\rm online}(s'_k, a) \big)$

Training Objective: Minimize the mean-squared TD error:

$L_{\rm TD}(\theta) = \frac{1}{B}\sum_{k=1}^B (y_k - Q_1(s_k, a_k; \theta))^2$

Parameter Updates: Auxiliary networks are softly updated:

$\theta' \leftarrow \tau\,\theta + (1-\tau)\theta', \qquad \theta_m \leftarrow \tau\,\theta + (1-\tau)\theta_m$

with small $\tau \ll 1$ .

Stability: Experience replay, soft target updates, $\varepsilon$ -greedy exploration, and the mediator Q-network reduce training bias and variance compared to standard DQN/DDQN approaches.

3. Imitation-Based Elite Mechanism

To mitigate inefficient exploration, ITDQN employs an elite imitation protocol at intervals of $\delta$ episodes:

Every agent $i$ executes its current policy for one trajectory, recording $(s_{i,t},a_{i,t},r_{i,t})$ for $t=0$ to $K$ .
Mean $\mu_i$ and variance $\sigma^2_i$ of cumulative rewards compute the elite score:

$E\mathcal R_i = \beta_1 \mu_i + \beta_2 \sigma^2_i$

The elite agent $j$ with the highest $E\mathcal R_j$ is identified.
Each agent $i \ne j$ updates parameters via

$\theta_i \leftarrow (1 - \vartheta)\theta_i + \vartheta \theta_j$

where $\vartheta$ is a decaying imitation-strength, and $\delta$ is an increasing patience factor.

This procedure performs parameter-level imitation without a demonstration buffer or explicit auxiliary losses, effectively minimizing $\|\theta_i - \theta_j\|^2$ in imitation episodes. This mechanism accelerates early-stage learning and supports rapid transfer of successful behaviors across agents.

4. Training Regimen and Stability Considerations

The network initialization, episodic training loop, and stability enhancements are as follows:

All three networks ( $\theta$ , $\theta'$ , $\theta_m$ ) and the experience replay buffer $\mathcal D$ are initialized.
For each episode $\lambda$ , if $\lambda \bmod \delta = 0$ , the elite imitation mechanism is triggered.
At each step $t$ , agents select actions via $\varepsilon$ -greedy policy over $Q_{\rm online}$ , execute them, log the transitions, and update the online network on sampled minibatches.
Online learning is interleaved with soft updates to $Q_2$ , $Q_m$ .
$\varepsilon$ (exploration decay) and $\vartheta$ (imitation decay) are progressively annealed.

Stability and convergence are further reinforced by the mediator network, which mitigates overestimation bias typical of single- or double-network Q-learning. Ablation results indicate that dropping either the mediator or imitation components degrades convergence speed and final reward.

5. Empirical Performance and Evaluation Metrics

Evaluations in simulated and real-world smart farming environments demonstrate the quantitative and qualitative benefits of ITDQN over DDQN and single-network DQN. Relevant results include:

Metric	ITDQN	DDQN	Gain
Weed Recognition Rate	79.43%	75.00%	+4.43 pp
Data Collection Rate	98.05%	91.11%	+6.94 pp
Inference Overhead (per action, ms)	6.5	6.2	+0.3

Convergence: ITDQN exhibits faster convergence and reduced episodic reward fluctuations.
Energy & Time: Increases are negligible and remain within operational constraints.
Ablations: Both mediator and imitation modules are essential for peak performance.

This suggests mediator-augmented value estimation and elite imitation synergistically enhance MARL under partial observability (Mao et al., 21 Dec 2025).

6. Scope, Generalizability, and Limitations

The ITDQN paradigm—triple Q-networks with elite parameter-level imitation—possesses general applicability to multi-agent, partially observable domains such as robot coverage and collaborative autonomous driving. Documented limitations include:

Additional computation and memory invited by the third Q-network and imitation cycle.
Parameter-level imitation may lack granularity for nuanced behavior alignment.
The use of a fixed Gaussian variance $\sigma^2$ may not generalize to all domains.

Potential extensions include incorporating explicit imitation losses (such as $L_{im} = \mathbb{E}_{(s,a)\sim D_e}\|Q(s,a;\theta) - Q(s,a;\theta^*)\|^2$ ), integration with distributional RL or policy-gradient frameworks, and extensive real-world validation with dynamic and heterogeneous UAV swarms.

7. Summary and Novel Contributions

ITDQN introduces two principal innovations:

An elite imitation mechanism for parameter-level one-to-many policy transfer within MARL, bypassing explicit demonstration buffers and auxiliary losses.
A mediator Q-network that integrates with both online and target networks, lowering overestimation bias and stabilizing value propagation.

Experimental evidence, drawn from both simulation and live indoor/outdoor UAV tests, confirms superior policy convergence, stability, and task effectiveness compared to conventional DDQN approaches (Mao et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Imitation-Based Triple Deep Q-Network (ITDQN).