Intention-aware Hierarchical Diffusion Model

Updated 28 September 2025

The paper introduces IHiD, a model that integrates hierarchical intention analysis with diffusion-based synthesis to generate detailed trajectories.
It employs inverse Q-learning for strategic intent evaluation and a denoising diffusion process to reconstruct coherent, low-level behaviors.
Empirical results demonstrate a notable improvement in anomaly detection and motion planning, with up to a 30.2% increase in F1 score over baselines.

An Intention-aware Hierarchical Diffusion Model (IHiD) is a generative or predictive architecture that integrates hierarchical modeling of agent intentions and corresponding detailed data generation via diffusion processes. IHiD structures trajectory, motion, or generative signal modeling into discrete levels, with high-level intent evaluation (such as goal or subgoal reasoning) guiding lower-level trajectory or behavior synthesis, frequently using denoising diffusion probabilistic models (DDPMs). The aim is to capture both the diversity of agent strategies (“what” agents plan to do) and the rich variation in execution (“how” agents realize these plans), providing enhanced interpretability and improved anomaly detection, forecasting, control, or synthesis performance.

1. Hierarchical Architecture of IHiD

IHiD adopts a two-level hierarchical structure for long-term data modeling, prominently used in trajectory anomaly detection (Wang et al., 21 Sep 2025), motion planning (Chen et al., 5 Jan 2024), generative modeling (Zhang et al., 8 Dec 2024), and expressive robotic motion (Bao et al., 2 Jun 2025). At the top level, agent intentions are discretized as sequences of subgoals or intentions $\phi = \{g_1, g_2, ..., g_k\}$ , representing strategic elements of the trajectory or task. The lower level focuses on the stochastic yet temporally coherent generation of subtrajectories or detailed behaviors $\{\hat{\tau}_1, ..., \hat{\tau}_k\}$ conditioned on these high-level plans.

This architecture formalizes the overall data as a composition:

$\tau = \bigcup_{i=1}^{k} (\text{subgoal}_i, \text{subtrajectory}_i)$

Hierarchical decomposition is fundamental for dealing with complex, long-horizon behaviors, where both strategic intent and local execution must be modeled and evaluated.

2. Modeling High-Level Intention

High-level intention modeling relies on strategic inference frameworks such as Inverse Q Learning (IQL) (Wang et al., 21 Sep 2025), which assesses the validity and alignment of subgoal selections:

State–action representation: Subgoals as states ( $s$ ), transitions between subgoals as actions ( $a$ ).
Q-value evaluation: $\mathcal{Q}(s, a)$ quantifies expected value of choosing subgoal $a$ from state $s$ , trained via:

$\mathcal{T}[Q](s, a) = Q(s, a) - \gamma\, \mathbb{E}_{s' \sim P(s'|s,a)} [V^Q(s')]$

where $V^Q(s) = \log \sum_{a} \exp(Q(s,a))$ , and $\gamma$ is the discount factor. Low Q-values relative to a threshold indicate anomalous or atypical intent.

Other applications, such as expressive robotic motion synthesis, leverage in-context learning with large vision–LLMs to interpret human intention from multi-modal sensory data and dialogue history (Bao et al., 2 Jun 2025). This module returns structured predictions consisting of intent labels, motion primitives, social context, and confidence scores, which are passed downstream for detailed behavior synthesis.

3. Conditional Diffusion-Based Generation

The low-level synthesis component utilizes diffusion models, frequently of the DDPM class, to generate detailed subtrajectories, action sequences, or sensory data, conditioned on the high-level intentions or subgoal transitions. The process consists of:

Forward process: Gradual noise injection,

$\hat{\tau}^t \sim \mathcal{N}\bigl(\sqrt{\bar{\alpha}^t}\, \hat{\tau}^0,\, (1-\bar{\alpha}^t)I\bigr)$

for each timestep $t$ .

Reverse process: Learned denoising, often using a UNet with transformer blocks and cross-attention on subgoal context,

$c_i = \rho \cdot \text{CrossAttention}(g_i, g_{i+1}, \hat{\tau}^t) + (1-\rho)\cdot \hat{\tau}^t$

where $\rho$ balances context injection.

Noise estimation: Network output $\epsilon_\theta(c_i, t)$ is trained to recover clean data from the noisy signal using standard diffusion objectives.

Subgoal transitions with anomalous Q-values bypass the low-level synthesis; otherwise, the observed subtrajectory is reconstructed and compared to assess normality (Wang et al., 21 Sep 2025).

4. Integrated Anomaly Detection Protocol

For long-term trajectory anomaly detection, the IHiD detection protocol is as follows (Wang et al., 21 Sep 2025):

Subgoal segmentation: Segment the trajectory into discrete subgoals.
High-level validation: Apply IQL to each subgoal transition. If $Q(s, a) < \gamma$ (for chosen threshold $\gamma$ ), mark as anomalous.
Low-level synthesis: For valid transitions, reconstruct subtrajectories with the diffusion model using subgoal pair conditioning.
Reconstruction error: Calculate $E_\Delta = \frac{1}{N} \|\hat{\tau}_i - \hat{\tau}'_i\|^2$ ; if $E_\Delta > \beta$ , label as anomalous (with threshold $\beta$ ).

This joint protocol allows simultaneous detection of strategic (intent-level) and tactical (execution-level) anomalies.

5. Evaluation and Empirical Performance

On real-world datasets—Chengdu taxi and AIS vessel trajectories—IHiD demonstrated robust anomaly detection performance, including up to 30.2% improvement in $F_1$ score relative to state-of-the-art baselines (Wang et al., 21 Sep 2025). The architecture correctly diagnoses:

Strategic anomalies (route switching, abnormal destination selection via low Q-value)
Tactical anomalies (detour, navigation deviations via high reconstruction error)

Ablation studies revealed the necessity of both levels: high-level modules excel at detecting intent shifts; low-level diffusion models capture nuanced behavioral variation. Integration yields average $F_1$ scores up to ~90.7%, confirming comprehensive coverage.

6. Broader Applications and Implications

The hierarchical intention-aware paradigm extends beyond anomaly detection:

Motion planning: IHiD architectures are utilized for compositional generalization and jumpy planning in RL and imitation learning (Chen et al., 5 Jan 2024).
Trajectory prediction: Decoupling intention from action via dual diffusion processes allows fast and interpretable forecasting in autonomous driving contexts (Liu et al., 14 Mar 2024, Liu et al., 6 Aug 2025, Liu et al., 10 Aug 2025).
Generative modeling: Hierarchical latent priors and nested diffusion processes enhance the quality, controllability, and efficiency of image and signal synthesis (Zhang et al., 8 Dec 2024).
Human–robot interaction: Structured prompting, fallback behaviors, and iterative intention refinement through vision–LLMs and diffusion-based motion synthesis facilitate socially appropriate, real-time expressive gestures (Bao et al., 2 Jun 2025).

These IHiD architectures are generalizable to other domains requiring stratified intent reasoning and rich data generation, including anomaly surveillance, multimodal communication, and general-purpose policy synthesis.

7. Significance and Future Trajectories

IHiD demonstrates the utility of hierarchical modeling for combining interpretable intent modeling with powerful generative techniques. The empirical and theoretical analyses support its superiority for anomaly detection, prediction, planning, and synthesis in spatiotemporal and generative settings. A plausible implication is that further advances may integrate alternative strategic reasoning models (beyond IQL), richer context injection mechanisms, and adaptive hierarchy depths for broader applicability. Hierarchical models that synthesize both intentionality and execution are positioned to become essential tools across fields requiring robust explainability, data diversity, and efficient inference.