Heatmap Pooling Network (HP-Net)

Updated 7 December 2025

HP-Net is a modular network architecture that integrates spatial scene encoding and dynamic, interaction-aware context for label-efficient trajectory prediction.
It employs three key modules—environment encoder, global dynamics encoder, and trajectory predictor—each optimized with distinct loss functions to capture scene structure and motion.
Empirical results demonstrate its robust performance and unsupervised domain adaptation across diverse settings such as crowded scenes and robotics.

A Heatmap Pooling Network (HP-Net) is a modular network architecture for trajectory prediction that jointly learns spatial scene encodings and dynamic, interaction-aware context, enabling label-efficient transfer to novel environments without supervised trajectory labels. HP-Net, as described in "Learning Structured Representations of Spatial and Interactive Dynamics for Trajectory Prediction in Crowded Scenes" (Davchev et al., 2019), is built upon three submodules—an environment (spatial) encoder, a global dynamics encoder, and a local trajectory predictor—each optimized under complementary objectives. Their structured interaction enables robust prediction across domains, including unsupervised adaptation to new visual contexts using only unlabelled image data.

1. Architectural Components and Data Flow

HP-Net comprises three primary modules:

Environment Encoder (R):
- Input: raw scene frame $I_t \in \mathbb{R}^{H \times W \times 3}$
- Output: low-dimensional latent representation $l_t \in \mathbb{R}^d$ , encoding spatial structure such as static obstacles, walkways, and scene layouts.
Global Dynamics Encoder (D):
- Input: latent code $l_t$ and the set of all current agent positions $S_t = \{s_t^1, ..., s_t^N\}$ where $s_t^a \in \mathbb{R}^2$ .
- Architecture: LSTM followed by a Mixture Density Network (MDN).
- Output: a stochastic prediction of the next scene code $l_{t+1}$ and summary vector $h_t \in \mathbb{R}^h$ capturing aggregate scene dynamics, such as dominant flows or typical movement patterns.
Trajectory Predictor (B):
- Input: for each agent $a$ , the tuple $(s_t^a, l_t, h_t)$ .
- Architecture: agent-specific LSTM.
- Output: Parameters $(\mu, \Sigma, \rho)$ of a bivariate Gaussian distribution predicting the agent’s next position.

The overall system can be represented as:

$I_t \xrightarrow{R} l_t$
$S_t, l_t \xrightarrow{D} h_t$
Per-agent: $s_t^a, l_t, h_t \xrightarrow{B} p(s_{t+1}^a)$

No explicit message-passing or graph aggregation among agents occurs; instead, all agent interactions are summarized globally via $D$ , and this context is broadcast to every agent through $h_t$ .

2. Objective Functions and Formalization

The HP-Net framework is supervised by four distinct objectives:

Environment (Spatial) Reconstruction Loss— $L_\mathrm{env}$ : The environment encoder $R$ is regularized using an InfoVAE setup, which combines a Maximum Mean Discrepancy (MMD) prior matching with an image reconstruction term:

$L_\mathrm{env} = \mathrm{MMD}(q_\phi(l) \Vert p(l)) + \mathbb{E}_{I \sim p_\mathrm{data}} \mathbb{E}_{l \sim q_\phi(\cdot|I)} [-\log p_\theta(I|l)]$

Global Dynamics Prediction Loss— $L_\mathrm{dyn}$ : The LSTM-MDN in $D$ is trained to maximize the likelihood of the next latent code $l_{t+1}$ under a $K$ -component mixture of Gaussians:

$L_\mathrm{dyn} = -\sum_{t=1}^T \log \sum_{k=1}^K \pi_k(h_t)\, \mathcal{N}(l_{t+1} | \mu_k(h_t), \Sigma_k(h_t))$

Local Trajectory (Motion) Prediction Loss— $L_\mathrm{pred}$ : Each agent’s trajectory is modeled by minimizing the negative log-likelihood under the predicted Gaussian at each step:

$L_\mathrm{pred} = -\sum_{t=obs}^{obs+pred-1} \sum_{a=1}^N \log \mathcal{N}(s_{t+1}^a | \mu_t^a, \Sigma_t^a)$

Adaptation Loss— $L_\mathrm{adapt}$ : For unsupervised adaptation to a new domain (no trajectory labels), $R$ (and optionally $D$ ) is re-trained using only $L_\mathrm{env}$ (and optionally $L_\mathrm{dyn}$ ), keeping $B$ frozen:

$L_\mathrm{adapt} = L_\mathrm{env}(\text{new images}) + \lambda_\mathrm{dyn} L_\mathrm{dyn}(\text{new images})$

3. Training Protocol and Unsupervised Domain Adaptation

HP-Net is trained in a modular, stage-wise procedure:

Stage 1: Train spatial encoder $R$ on raw images with $L_\mathrm{env}$ .
Stage 2: With $R$ frozen, train $D$ on sequential pairs $(l_t, S_t)$ using $L_\mathrm{dyn}$ .
Stage 3: With $R$ , $D$ frozen, train $B$ on agent trajectories using $L_\mathrm{pred}$ .

For unsupervised adaptation, the model collects unlabeled frames from the target environment, re-trains $R$ (and optionally $D$ ) only with image-based losses ( $L_\mathrm{env}$ and $L_\mathrm{dyn}$ ), and deploys the fixed $B$ . Weakly supervised/few-shot adaptation is possible by fine-tuning $D$ and $B$ with limited labels.

Optimization details include Adam or RMSProp optimizers, learning rate $1\text{e}^{-4}$ – $1\text{e}^{-3}$ , scene-wise mini-batching, and control of uncertainty in dynamic predictions via output temperature.

4. Integration of Spatial and Motion Cues in Trajectory Prediction

In HP-Net, the environment code ( $l_t$ ) and global dynamics ( $h_t$ ) are concatenated into the per-agent input for the trajectory LSTM, ensuring that both static spatial layout and interaction-driven context are provided at every step:

$x_t^a = [s_t^a, l_t, h_t]$

There is no explicit graph-based interaction; the model differentiates itself from graph-based or purely local techniques by using a global latent scene code and dynamics summary for all agents, which facilitates rapid adaptation and transfer.

5. Empirical Performance and Domain Transfer

Table: Prediction Errors on ETH/UCY (pixel-normalized)

Method	3.2s (avg/final)	4.8s (avg/final)
Social-LSTM	0.080 / 0.160	0.130 / 0.261
SNS-LSTM	0.035 / 0.140	0.040 / 0.228
HP-Net (RDB)	0.046 / 0.088	0.070 / 0.137

The HP-Net achieves performance that matches or exceeds prior methods on standard benchmarks such as ETH/UCY for leave-one-scene-out trajectory prediction.

Cross-domain transfer:

Unsupervised adaptation (crowd $\to$ robot): ADE $\approx$ 0.27 (vs. 0.10 fully supervised)
Unsupervised adaptation (robot $\to$ crowd): ADE $\approx$ 0.16 (vs. 0.04 fully supervised)

These results demonstrate HP-Net’s capacity for robust unsupervised domain adaptation. The isolation of spatial, dynamic, and trajectory modules enables “plug-and-play” transfer: only scene representations need re-training, while the trajectory predictor maintains its predictive power across tasks and visual domains.

6. Implications and Significance

The HP-Net decoupling of local, spatial, and dynamic representations enables:

Label-efficient transfer: Unsupervised adaptation with raw video only, without the need for new trajectory labels in the novel environment, which is particularly significant in robotics or crowded scenes where annotations are expensive.
Informed dynamic prediction: Global context from $R$ and $D$ allows the trajectory predictor $B$ to leverage both static and dynamic scene factors, yielding improved generalization and stability across visual domains.
Structured modularity: Explicit separation of modules allows extensibility (e.g., swapping $R$ for a more powerful spatial encoder or replacing $B$ with a higher-order agent motion predictor).

A plausible implication is that such modular, co-learned heatmap pooling representations can serve as a backbone for broader multi-agent prediction systems that require adaptability and cross-domain robustness.

7. Limitations and Future Directions

The HP-Net strategy, while highly effective for label-efficient transfer and robust prediction, does not incorporate explicit message-passing or agent-centric graph reasoning; all agent interactions are summarized globally. Extending the architecture with higher-order interaction mechanisms or incorporating richer, task-specific adaptation strategies may yield improved performance in domains with more complex or hierarchical social dynamics. Further, the InfoVAE-based spatial representation is tailored for environments where scene layouts are critical; generalization to domains with ambiguous or dynamic background elements remains an open research question.

For an in-depth technical specification and empirical validation, see "Learning Structured Representations of Spatial and Interactive Dynamics for Trajectory Prediction in Crowded Scenes" (Davchev et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Structured Representations of Spatial and Interactive Dynamics for Trajectory Prediction in Crowded Scenes (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Heatmap Pooling Network (HP-Net).