ITDQN: Imitation-Based Triple Deep Q-Network
- The paper introduces a tri-network framework that combines online, target, and mediator Q-networks with an elite imitation mechanism to enhance learning.
- ITDQN improves sample efficiency and reduces overestimation bias, achieving higher weed recognition and data collection rates in simulated smart agriculture.
- The approach enables rapid policy convergence and robust performance under partial observability and UAV battery constraints using value-ensemble strategies.
The Imitation-Based Triple Deep Q-Network (ITDQN) is a multi-agent reinforcement learning (MARL) algorithm that extends Double Deep Q-Networks (DDQN) by introducing a third, mediator Q-network and an elite imitation mechanism. ITDQN addresses trajectory planning challenges for unmanned aerial vehicles (UAVs) in smart agriculture, particularly under partial observability, stochastic environments, and battery limitations. Its architecture integrates value-ensemble techniques with parameter-level policy mimicry, improving sample efficiency, exploration, and policy stability compared to conventional DQN and DDQN approaches (Mao et al., 21 Dec 2025).
1. Formal Problem Specification
The ITDQN framework formulates the UAV trajectory problem as a Markov decision process (MDP) on an agricultural grid. Multiple UAVs ( agents) operate in discrete episodes of length .
- State Space : Each UAV at time receives a partial state,
where is spatial location; is a grid of weed data; is weed density; and give distance/direction to the nearest sensor; gives inter-UAV distance. Partial observability emerges from limited camera field-of-view and randomized weed/sensor placement.
- Action Space : Eight discrete headings: .
- Reward Function:
with penalties for boundary violations (), battery exhaustion (), and unsafe proximity (); incentives for successful weed detection () and sensor data collection (); exploration/exploitation bonuses (, ); and a bias .
2. ITDQN Architecture and Mechanisms
ITDQN introduces three discrete Q-value networks per agent:
- (Primary/Online network): Parameterized by .
- (Mediator network): Parameterized by .
- (Target network): Parameterized by .
Action Selection: The agent forms a Gaussian-distributed Q-value,
and acts greedily: .
Target Value Computation: For a transition ,
Training Objective: Minimize the mean-squared TD error:
Parameter Updates: Auxiliary networks are softly updated:
with small .
Stability: Experience replay, soft target updates, -greedy exploration, and the mediator Q-network reduce training bias and variance compared to standard DQN/DDQN approaches.
3. Imitation-Based Elite Mechanism
To mitigate inefficient exploration, ITDQN employs an elite imitation protocol at intervals of episodes:
- Every agent executes its current policy for one trajectory, recording for to .
- Mean and variance of cumulative rewards compute the elite score:
- The elite agent with the highest is identified.
- Each agent updates parameters via
where is a decaying imitation-strength, and is an increasing patience factor.
This procedure performs parameter-level imitation without a demonstration buffer or explicit auxiliary losses, effectively minimizing in imitation episodes. This mechanism accelerates early-stage learning and supports rapid transfer of successful behaviors across agents.
4. Training Regimen and Stability Considerations
The network initialization, episodic training loop, and stability enhancements are as follows:
- All three networks (, , ) and the experience replay buffer are initialized.
- For each episode , if , the elite imitation mechanism is triggered.
- At each step , agents select actions via -greedy policy over , execute them, log the transitions, and update the online network on sampled minibatches.
- Online learning is interleaved with soft updates to , .
- (exploration decay) and (imitation decay) are progressively annealed.
Stability and convergence are further reinforced by the mediator network, which mitigates overestimation bias typical of single- or double-network Q-learning. Ablation results indicate that dropping either the mediator or imitation components degrades convergence speed and final reward.
5. Empirical Performance and Evaluation Metrics
Evaluations in simulated and real-world smart farming environments demonstrate the quantitative and qualitative benefits of ITDQN over DDQN and single-network DQN. Relevant results include:
| Metric | ITDQN | DDQN | Gain |
|---|---|---|---|
| Weed Recognition Rate | 79.43% | 75.00% | +4.43 pp |
| Data Collection Rate | 98.05% | 91.11% | +6.94 pp |
| Inference Overhead (per action, ms) | 6.5 | 6.2 | +0.3 |
- Convergence: ITDQN exhibits faster convergence and reduced episodic reward fluctuations.
- Energy & Time: Increases are negligible and remain within operational constraints.
- Ablations: Both mediator and imitation modules are essential for peak performance.
This suggests mediator-augmented value estimation and elite imitation synergistically enhance MARL under partial observability (Mao et al., 21 Dec 2025).
6. Scope, Generalizability, and Limitations
The ITDQN paradigm—triple Q-networks with elite parameter-level imitation—possesses general applicability to multi-agent, partially observable domains such as robot coverage and collaborative autonomous driving. Documented limitations include:
- Additional computation and memory invited by the third Q-network and imitation cycle.
- Parameter-level imitation may lack granularity for nuanced behavior alignment.
- The use of a fixed Gaussian variance may not generalize to all domains.
Potential extensions include incorporating explicit imitation losses (such as ), integration with distributional RL or policy-gradient frameworks, and extensive real-world validation with dynamic and heterogeneous UAV swarms.
7. Summary and Novel Contributions
ITDQN introduces two principal innovations:
- An elite imitation mechanism for parameter-level one-to-many policy transfer within MARL, bypassing explicit demonstration buffers and auxiliary losses.
- A mediator Q-network that integrates with both online and target networks, lowering overestimation bias and stabilizing value propagation.
Experimental evidence, drawn from both simulation and live indoor/outdoor UAV tests, confirms superior policy convergence, stability, and task effectiveness compared to conventional DDQN approaches (Mao et al., 21 Dec 2025).