Papers
Topics
Authors
Recent
2000 character limit reached

ITDQN: Imitation-Based Triple Deep Q-Network

Updated 28 December 2025
  • The paper introduces a tri-network framework that combines online, target, and mediator Q-networks with an elite imitation mechanism to enhance learning.
  • ITDQN improves sample efficiency and reduces overestimation bias, achieving higher weed recognition and data collection rates in simulated smart agriculture.
  • The approach enables rapid policy convergence and robust performance under partial observability and UAV battery constraints using value-ensemble strategies.

The Imitation-Based Triple Deep Q-Network (ITDQN) is a multi-agent reinforcement learning (MARL) algorithm that extends Double Deep Q-Networks (DDQN) by introducing a third, mediator Q-network and an elite imitation mechanism. ITDQN addresses trajectory planning challenges for unmanned aerial vehicles (UAVs) in smart agriculture, particularly under partial observability, stochastic environments, and battery limitations. Its architecture integrates value-ensemble techniques with parameter-level policy mimicry, improving sample efficiency, exploration, and policy stability compared to conventional DQN and DDQN approaches (Mao et al., 21 Dec 2025).

1. Formal Problem Specification

The ITDQN framework formulates the UAV trajectory problem as a Markov decision process (MDP) on an N×NN\times N agricultural grid. Multiple UAVs (nUAVn_{\rm UAV} agents) operate in discrete episodes of length TmaxT_{\max}.

  • State Space S\mathcal S: Each UAV ii at time tt receives a partial state,

si,t=(xi,t,yi,t,weed‐FoVi,t,ρi,t,di,tsensor,θi,tsensor,di,tUAV)s_{i,t} = (x_{i,t}, y_{i,t}, \text{weed‐FoV}_{i,t}, \rho_{i,t}, d^{\rm sensor}_{i,t}, \theta^{\rm sensor}_{i,t}, d^{\rm UAV}_{i,t})

where (xi,t,yi,t)(x_{i,t}, y_{i,t}) is spatial location; weed‐FoVi,t\text{weed‐FoV}_{i,t} is a 3×33 \times 3 grid of weed data; ρi,t\rho_{i,t} is weed density; di,tsensord^{\rm sensor}_{i,t} and θi,tsensor\theta^{\rm sensor}_{i,t} give distance/direction to the nearest sensor; di,tUAVd^{\rm UAV}_{i,t} gives inter-UAV distance. Partial observability emerges from limited camera field-of-view and randomized weed/sensor placement.

  • Action Space A\mathcal A: Eight discrete headings: {N, NE, E, SE, S, SW, W, NW}\{\text{N, NE, E, SE, S, SW, W, NW}\}.
  • Reward Function:

ri,t=PoutPbatPclo+Iweed+Idata+Iexploit+Iexplore+br_{i,t} = -P_{\rm out} - P_{\rm bat} - P_{\rm clo} + I_{\rm weed} + I_{\rm data} + I_{\rm exploit} + I_{\rm explore} + b

with penalties for boundary violations (PoutP_{\rm out}), battery exhaustion (PbatP_{\rm bat}), and unsafe proximity (PcloP_{\rm clo}); incentives for successful weed detection (IweedI_{\rm weed}) and sensor data collection (IdataI_{\rm data}); exploration/exploitation bonuses (IexploitI_{\rm exploit}, IexploreI_{\rm explore}); and a bias bb.

2. ITDQN Architecture and Mechanisms

ITDQN introduces three discrete Q-value networks per agent:

  • Q1Q_1 (Primary/Online network): Parameterized by θ\theta.
  • QmQ_m (Mediator network): Parameterized by θm\theta_m.
  • Q2Q_2 (Target network): Parameterized by θ\theta'.

Action Selection: The agent forms a Gaussian-distributed Q-value,

Qonline(s,a)N(12(Q1(s,a;θ)+Qm(s,a;θm)),σ2)Q_{\rm online}(s,a) \sim \mathcal N\left(\frac{1}{2}(Q_1(s,a;\theta) + Q_m(s,a;\theta_m)),\, \sigma^2\right)

and acts greedily: a=argmaxaQonline(s,a)a^* = \arg\max_a Q_{\rm online}(s,a).

Target Value Computation: For a transition (sk,ak,rk,sk,dk)(s_k, a_k, r_k, s'_k, d_k),

Qtarget(s,a)N(12(Q2(s,a;θ)+Qm(s,a;θm)),σ2)Q_{\rm target}(s, a) \sim \mathcal N\left( \frac{1}{2} (Q_2(s,a;\theta') + Q_m(s,a;\theta_m)),\, \sigma^2 \right)

yk=rk+(1dk)γQtarget(sk,argmaxaQonline(sk,a))y_k = r_k + (1 - d_k) \gamma Q_{\rm target}\big( s'_k, \arg\max_a Q_{\rm online}(s'_k, a) \big)

Training Objective: Minimize the mean-squared TD error:

LTD(θ)=1Bk=1B(ykQ1(sk,ak;θ))2L_{\rm TD}(\theta) = \frac{1}{B}\sum_{k=1}^B (y_k - Q_1(s_k, a_k; \theta))^2

Parameter Updates: Auxiliary networks are softly updated:

θτθ+(1τ)θ,θmτθ+(1τ)θm\theta' \leftarrow \tau\,\theta + (1-\tau)\theta', \qquad \theta_m \leftarrow \tau\,\theta + (1-\tau)\theta_m

with small τ1\tau \ll 1.

Stability: Experience replay, soft target updates, ε\varepsilon-greedy exploration, and the mediator Q-network reduce training bias and variance compared to standard DQN/DDQN approaches.

3. Imitation-Based Elite Mechanism

To mitigate inefficient exploration, ITDQN employs an elite imitation protocol at intervals of δ\delta episodes:

  • Every agent ii executes its current policy for one trajectory, recording (si,t,ai,t,ri,t)(s_{i,t},a_{i,t},r_{i,t}) for t=0t=0 to KK.
  • Mean μi\mu_i and variance σi2\sigma^2_i of cumulative rewards compute the elite score:

ERi=β1μi+β2σi2E\mathcal R_i = \beta_1 \mu_i + \beta_2 \sigma^2_i

  • The elite agent jj with the highest ERjE\mathcal R_j is identified.
  • Each agent iji \ne j updates parameters via

θi(1ϑ)θi+ϑθj\theta_i \leftarrow (1 - \vartheta)\theta_i + \vartheta \theta_j

where ϑ\vartheta is a decaying imitation-strength, and δ\delta is an increasing patience factor.

This procedure performs parameter-level imitation without a demonstration buffer or explicit auxiliary losses, effectively minimizing θiθj2\|\theta_i - \theta_j\|^2 in imitation episodes. This mechanism accelerates early-stage learning and supports rapid transfer of successful behaviors across agents.

4. Training Regimen and Stability Considerations

The network initialization, episodic training loop, and stability enhancements are as follows:

  • All three networks (θ\theta, θ\theta', θm\theta_m) and the experience replay buffer D\mathcal D are initialized.
  • For each episode λ\lambda, if λmodδ=0\lambda \bmod \delta = 0, the elite imitation mechanism is triggered.
  • At each step tt, agents select actions via ε\varepsilon-greedy policy over QonlineQ_{\rm online}, execute them, log the transitions, and update the online network on sampled minibatches.
  • Online learning is interleaved with soft updates to Q2Q_2, QmQ_m.
  • ε\varepsilon (exploration decay) and ϑ\vartheta (imitation decay) are progressively annealed.

Stability and convergence are further reinforced by the mediator network, which mitigates overestimation bias typical of single- or double-network Q-learning. Ablation results indicate that dropping either the mediator or imitation components degrades convergence speed and final reward.

5. Empirical Performance and Evaluation Metrics

Evaluations in simulated and real-world smart farming environments demonstrate the quantitative and qualitative benefits of ITDQN over DDQN and single-network DQN. Relevant results include:

Metric ITDQN DDQN Gain
Weed Recognition Rate 79.43% 75.00% +4.43 pp
Data Collection Rate 98.05% 91.11% +6.94 pp
Inference Overhead (per action, ms) 6.5 6.2 +0.3
  • Convergence: ITDQN exhibits faster convergence and reduced episodic reward fluctuations.
  • Energy & Time: Increases are negligible and remain within operational constraints.
  • Ablations: Both mediator and imitation modules are essential for peak performance.

This suggests mediator-augmented value estimation and elite imitation synergistically enhance MARL under partial observability (Mao et al., 21 Dec 2025).

6. Scope, Generalizability, and Limitations

The ITDQN paradigm—triple Q-networks with elite parameter-level imitation—possesses general applicability to multi-agent, partially observable domains such as robot coverage and collaborative autonomous driving. Documented limitations include:

  • Additional computation and memory invited by the third Q-network and imitation cycle.
  • Parameter-level imitation may lack granularity for nuanced behavior alignment.
  • The use of a fixed Gaussian variance σ2\sigma^2 may not generalize to all domains.

Potential extensions include incorporating explicit imitation losses (such as Lim=E(s,a)DeQ(s,a;θ)Q(s,a;θ)2L_{im} = \mathbb{E}_{(s,a)\sim D_e}\|Q(s,a;\theta) - Q(s,a;\theta^*)\|^2), integration with distributional RL or policy-gradient frameworks, and extensive real-world validation with dynamic and heterogeneous UAV swarms.

7. Summary and Novel Contributions

ITDQN introduces two principal innovations:

  1. An elite imitation mechanism for parameter-level one-to-many policy transfer within MARL, bypassing explicit demonstration buffers and auxiliary losses.
  2. A mediator Q-network that integrates with both online and target networks, lowering overestimation bias and stabilizing value propagation.

Experimental evidence, drawn from both simulation and live indoor/outdoor UAV tests, confirms superior policy convergence, stability, and task effectiveness compared to conventional DDQN approaches (Mao et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Imitation-Based Triple Deep Q-Network (ITDQN).