Papers
Topics
Authors
Recent
Search
2000 character limit reached

Heatmap Pooling Network (HP-Net)

Updated 7 December 2025
  • HP-Net is a modular network architecture that integrates spatial scene encoding and dynamic, interaction-aware context for label-efficient trajectory prediction.
  • It employs three key modules—environment encoder, global dynamics encoder, and trajectory predictor—each optimized with distinct loss functions to capture scene structure and motion.
  • Empirical results demonstrate its robust performance and unsupervised domain adaptation across diverse settings such as crowded scenes and robotics.

A Heatmap Pooling Network (HP-Net) is a modular network architecture for trajectory prediction that jointly learns spatial scene encodings and dynamic, interaction-aware context, enabling label-efficient transfer to novel environments without supervised trajectory labels. HP-Net, as described in "Learning Structured Representations of Spatial and Interactive Dynamics for Trajectory Prediction in Crowded Scenes" (Davchev et al., 2019), is built upon three submodules—an environment (spatial) encoder, a global dynamics encoder, and a local trajectory predictor—each optimized under complementary objectives. Their structured interaction enables robust prediction across domains, including unsupervised adaptation to new visual contexts using only unlabelled image data.

1. Architectural Components and Data Flow

HP-Net comprises three primary modules:

  1. Environment Encoder (R):
    • Input: raw scene frame ItRH×W×3I_t \in \mathbb{R}^{H \times W \times 3}
    • Output: low-dimensional latent representation ltRdl_t \in \mathbb{R}^d, encoding spatial structure such as static obstacles, walkways, and scene layouts.
  2. Global Dynamics Encoder (D):
    • Input: latent code ltl_t and the set of all current agent positions St={st1,...,stN}S_t = \{s_t^1, ..., s_t^N\} where staR2s_t^a \in \mathbb{R}^2.
    • Architecture: LSTM followed by a Mixture Density Network (MDN).
    • Output: a stochastic prediction of the next scene code lt+1l_{t+1} and summary vector htRhh_t \in \mathbb{R}^h capturing aggregate scene dynamics, such as dominant flows or typical movement patterns.
  3. Trajectory Predictor (B):
    • Input: for each agent aa, the tuple (sta,lt,ht)(s_t^a, l_t, h_t).
    • Architecture: agent-specific LSTM.
    • Output: Parameters (μ,Σ,ρ)(\mu, \Sigma, \rho) of a bivariate Gaussian distribution predicting the agent’s next position.

The overall system can be represented as:

  • ItRltI_t \xrightarrow{R} l_t
  • St,ltDhtS_t, l_t \xrightarrow{D} h_t
  • Per-agent: sta,lt,htBp(st+1a)s_t^a, l_t, h_t \xrightarrow{B} p(s_{t+1}^a)

No explicit message-passing or graph aggregation among agents occurs; instead, all agent interactions are summarized globally via DD, and this context is broadcast to every agent through hth_t.

2. Objective Functions and Formalization

The HP-Net framework is supervised by four distinct objectives:

  1. Environment (Spatial) Reconstruction Loss—LenvL_\mathrm{env}: The environment encoder RR is regularized using an InfoVAE setup, which combines a Maximum Mean Discrepancy (MMD) prior matching with an image reconstruction term:

Lenv=MMD(qϕ(l)p(l))+EIpdataElqϕ(I)[logpθ(Il)]L_\mathrm{env} = \mathrm{MMD}(q_\phi(l) \Vert p(l)) + \mathbb{E}_{I \sim p_\mathrm{data}} \mathbb{E}_{l \sim q_\phi(\cdot|I)} [-\log p_\theta(I|l)]

  1. Global Dynamics Prediction Loss—LdynL_\mathrm{dyn}: The LSTM-MDN in DD is trained to maximize the likelihood of the next latent code lt+1l_{t+1} under a KK-component mixture of Gaussians:

Ldyn=t=1Tlogk=1Kπk(ht)N(lt+1μk(ht),Σk(ht))L_\mathrm{dyn} = -\sum_{t=1}^T \log \sum_{k=1}^K \pi_k(h_t)\, \mathcal{N}(l_{t+1} | \mu_k(h_t), \Sigma_k(h_t))

  1. Local Trajectory (Motion) Prediction Loss—LpredL_\mathrm{pred}: Each agent’s trajectory is modeled by minimizing the negative log-likelihood under the predicted Gaussian at each step:

Lpred=t=obsobs+pred1a=1NlogN(st+1aμta,Σta)L_\mathrm{pred} = -\sum_{t=obs}^{obs+pred-1} \sum_{a=1}^N \log \mathcal{N}(s_{t+1}^a | \mu_t^a, \Sigma_t^a)

  1. Adaptation Loss—LadaptL_\mathrm{adapt}: For unsupervised adaptation to a new domain (no trajectory labels), RR (and optionally DD) is re-trained using only LenvL_\mathrm{env} (and optionally LdynL_\mathrm{dyn}), keeping BB frozen:

Ladapt=Lenv(new images)+λdynLdyn(new images)L_\mathrm{adapt} = L_\mathrm{env}(\text{new images}) + \lambda_\mathrm{dyn} L_\mathrm{dyn}(\text{new images})

3. Training Protocol and Unsupervised Domain Adaptation

HP-Net is trained in a modular, stage-wise procedure:

  • Stage 1: Train spatial encoder RR on raw images with LenvL_\mathrm{env}.
  • Stage 2: With RR frozen, train DD on sequential pairs (lt,St)(l_t, S_t) using LdynL_\mathrm{dyn}.
  • Stage 3: With RR, DD frozen, train BB on agent trajectories using LpredL_\mathrm{pred}.

For unsupervised adaptation, the model collects unlabeled frames from the target environment, re-trains RR (and optionally DD) only with image-based losses (LenvL_\mathrm{env} and LdynL_\mathrm{dyn}), and deploys the fixed BB. Weakly supervised/few-shot adaptation is possible by fine-tuning DD and BB with limited labels.

Optimization details include Adam or RMSProp optimizers, learning rate 1e41\text{e}^{-4}1e31\text{e}^{-3}, scene-wise mini-batching, and control of uncertainty in dynamic predictions via output temperature.

4. Integration of Spatial and Motion Cues in Trajectory Prediction

In HP-Net, the environment code (ltl_t) and global dynamics (hth_t) are concatenated into the per-agent input for the trajectory LSTM, ensuring that both static spatial layout and interaction-driven context are provided at every step:

  • xta=[sta,lt,ht]x_t^a = [s_t^a, l_t, h_t]

There is no explicit graph-based interaction; the model differentiates itself from graph-based or purely local techniques by using a global latent scene code and dynamics summary for all agents, which facilitates rapid adaptation and transfer.

5. Empirical Performance and Domain Transfer

Table: Prediction Errors on ETH/UCY (pixel-normalized)

Method 3.2s (avg/final) 4.8s (avg/final)
Social-LSTM 0.080 / 0.160 0.130 / 0.261
SNS-LSTM 0.035 / 0.140 0.040 / 0.228
HP-Net (RDB) 0.046 / 0.088 0.070 / 0.137

The HP-Net achieves performance that matches or exceeds prior methods on standard benchmarks such as ETH/UCY for leave-one-scene-out trajectory prediction.

Cross-domain transfer:

  • Unsupervised adaptation (crowd \to robot): ADE \approx 0.27 (vs. 0.10 fully supervised)
  • Unsupervised adaptation (robot \to crowd): ADE \approx 0.16 (vs. 0.04 fully supervised)

These results demonstrate HP-Net’s capacity for robust unsupervised domain adaptation. The isolation of spatial, dynamic, and trajectory modules enables “plug-and-play” transfer: only scene representations need re-training, while the trajectory predictor maintains its predictive power across tasks and visual domains.

6. Implications and Significance

The HP-Net decoupling of local, spatial, and dynamic representations enables:

  • Label-efficient transfer: Unsupervised adaptation with raw video only, without the need for new trajectory labels in the novel environment, which is particularly significant in robotics or crowded scenes where annotations are expensive.
  • Informed dynamic prediction: Global context from RR and DD allows the trajectory predictor BB to leverage both static and dynamic scene factors, yielding improved generalization and stability across visual domains.
  • Structured modularity: Explicit separation of modules allows extensibility (e.g., swapping RR for a more powerful spatial encoder or replacing BB with a higher-order agent motion predictor).

A plausible implication is that such modular, co-learned heatmap pooling representations can serve as a backbone for broader multi-agent prediction systems that require adaptability and cross-domain robustness.

7. Limitations and Future Directions

The HP-Net strategy, while highly effective for label-efficient transfer and robust prediction, does not incorporate explicit message-passing or agent-centric graph reasoning; all agent interactions are summarized globally. Extending the architecture with higher-order interaction mechanisms or incorporating richer, task-specific adaptation strategies may yield improved performance in domains with more complex or hierarchical social dynamics. Further, the InfoVAE-based spatial representation is tailored for environments where scene layouts are critical; generalization to domains with ambiguous or dynamic background elements remains an open research question.


For an in-depth technical specification and empirical validation, see "Learning Structured Representations of Spatial and Interactive Dynamics for Trajectory Prediction in Crowded Scenes" (Davchev et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Heatmap Pooling Network (HP-Net).