Papers
Topics
Authors
Recent
Search
2000 character limit reached

SGNetPose+: Pose-Driven Trajectory Forecasting

Updated 11 December 2025
  • The paper introduces a dual-encoder, stepwise goal attention, and CVAE module to integrate pose and bounding box data for trajectory forecasting.
  • It employs ViTPose to extract 2D keypoints and computes key joint angles to capture nuanced pedestrian movements.
  • Empirical evaluations on JAAD_pose and PIE_pose datasets demonstrate reduced forecasting errors and improved prediction metrics.

SGNetPose+ refers to a set of advanced stepwise goal-driven neural architectures that incorporate explicit pose information for structured prediction tasks—primarily pedestrian trajectory forecasting for autonomous driving, but also, in alternative formulations, for high-dimensional whole-body 3D pose estimation. The essential innovation of SGNetPose+ is to condition predictive modeling not only on bounding-box trajectories, but also on fine-grained pose cues such as joint coordinates or body-segment angles, using a modular design that encodes spatial–temporal dependencies via recurrent attention and semantic graph networks (Ghiya et al., 11 Mar 2025, Wen et al., 2024).

1. Architectural Foundation and Motivation

SGNetPose+ builds upon the original SGNet (Salient Geometric Network) framework, which was developed for geometric tasks such as point cloud registration (Wu et al., 2023). The stepwise goal-driven paradigm is adapted by SGNetPose+ for sequential pedestrian trajectory prediction, addressing the limitations of purely bounding-box-based models that lack contextual cues about pedestrians’ intent or imminent changes in movement.

Motivation arises from the observation that human locomotion is highly non-linear and context-dependent—critical for autonomous driving systems, which must anticipate rare but hazardous behaviors (e.g., abrupt crossing, acceleration, turning). SGNetPose+ augments the trajectory modeling pipeline through explicit encoding of 2D pose skeletons or joint angles, enabling it to leverage kinematic patterns such as knee flexion, stride initiation, or torso lean, which serve as predictive signals for future motion (Ghiya et al., 11 Mar 2025).

2. Pose Extraction and Representation

Pose extraction in SGNetPose+ is performed using a pre-trained ViTPose human-pose transformer to generate 13 two-dimensional keypoints per pedestrian per frame: nose, shoulders, elbows, wrists, hips, knees, and ankles. Each keypoint pjp_j is a vector in R2\mathbb{R}^2 for j=1,,13j=1, \ldots, 13.

From these joints, 12 anatomically meaningful angles are computed using the formula: θijk=arccos((pipj)(pkpj)pipj  pkpj)\theta_{ijk} = \arccos \left( \frac{(p_i - p_j) \cdot (p_k - p_j)}{\|p_i - p_j\| \; \|p_k - p_j\|} \right) where θijk\theta_{ijk} is the angle at joint jj formed by segments iijj and kkjj. This parameterization encodes both pose and underlying body kinematics, as in the left knee angle (hip–knee–ankle) or torso orientation (hip–shoulder–contralateral hip) (Ghiya et al., 11 Mar 2025).

3. Data Augmentation and Preprocessing

To mitigate sample loss from invalid or undetectable pose frames, SGNetPose+ employs temporal augmentation by horizontal flipping of video data. This process duplicates the training set, preserves dynamic constraints, and addresses network invariance to left-right crossing maneuvers. For a frame of width WW, (x,y)(x,y) coordinates are mapped to (Wx,y)(W-x, y) under flipping, synchronously transforming bounding boxes and pose keypoints (Ghiya et al., 11 Mar 2025).

This strategy yields substantial increases in effective training-data size—118/18/102 batches for train/val/test splits on JAAD_pose, with similar expansion observed on the PIE dataset.

4. Network Architecture: Dual-Encoder and Stepwise Goal-Attention

SGNetPose+ consists of parallel input encoders for bounding box and pose features, a stepwise goal-driven attention (SGE) module, a conditional variational autoencoder (CVAE) for multimodal latent trajectory modeling, and a recurrent decoder.

  • Bounding-Box Encoder: Each frame's box xtR4x_t \in \mathbb{R}^4 is linearly embedded into latent feature xteRdx_t^e \in \mathbb{R}^d.
  • Pose Encoder: 2D keypoints ptR13×2p_t \in \mathbb{R}^{13 \times 2} are flattened and mapped via a fully connected layer with dropout to pteRdp_t^e \in \mathbb{R}^d.
  • Stepwise Goal-Driven Attention (SGE): For each time-step tt, previous aggregated goal x~te\tilde{x}_t^e is concatenated with box embedding xtex_t^e, and processed by recurrent units (GRU) for encoder (hteh_t^e) and goal generator (htgh_t^g). Goal aggregation employs trainable attention: w=Softmax(WwTtanh(hg)+bw),x~t+ie=s=t+it+ldwshsgw = \operatorname{Softmax}(W_w^T \tanh(h^g) + b_w),\qquad \tilde{x}_{t+i}^e = \sum_{s=t+i}^{t+l_d} w_s h_s^g
  • CVAE Goal Modeling: Captures uncertainty by inferring a latent variable zz via both recognition (N(μq,σq)\mathcal{N}(\mu^q, \sigma^q)) and prior (N(μp,σp)\mathcal{N}(\mu^p, \sigma^p)) networks.
  • Decoder: Sequentially regresses future trajectory points y^t+i\hat{y}_{t+i} with GRU state updates and goal conditioning: y^t+i=Woht+id+bo,y^R2\hat{y}_{t+i} = W_o h^d_{t+i} + b_o,\qquad \hat{y} \in \mathbb{R}^2

5. Training Objectives and Loss Functions

SGNetPose+ is optimized end-to-end using a composite objective: L=Ltraj+βLKL\mathcal{L} = \mathcal{L}_\mathrm{traj} + \beta \mathcal{L}_\mathrm{KL} where:

  • Ltraj=i=1ldyt+iy^t+i22\mathcal{L}_\mathrm{traj} = \sum_{i=1}^{l_d} \| y_{t+i} - \hat{y}_{t+i} \|_2^2 is the future trajectory regression loss,
  • LKL\mathcal{L}_\mathrm{KL} is the Kullback-Leibler divergence between the posterior and prior of the CVAE latent variable,
  • β=1\beta=1 in practice (Ghiya et al., 11 Mar 2025).

Optimization is performed using the Adam algorithm (lr =103=10^{-3}, batch size 16, dropout 0.5) with convergence in 30–40 epochs.

6. Experimental Evaluation and Results

SGNetPose+ is evaluated on the "JAAD_pose" and "PIE_pose" pedestrian datasets, both filtered for valid ViTPose extractions and augmented via flipping.

Performance is measured using:

  • MSE(k)\mathrm{MSE}(k): mean squared error kk frames ahead,
  • FMSE\mathrm{FMSE}: final-frame MSE at prediction horizon,
  • CMSE\mathrm{CMSE}/CFMSE\mathrm{CFMSE}: centroid MSE and final centroid MSE.

Quantitative improvements are evidenced as follows:

Dataset Model MSE(15) MSE(45/30) FMSE Relative Δ
JAAD_pose SGNet bb 68.46 384.25 1250.61
JAAD_pose SGNetPose+ 62.62 347.07 1080.45 –13.6% (FMSE)
PIE_pose SGNet bb 17.05 41.98
PIE_pose SGNetPose+ 15.81 40.08 –7.3% (MSE15)

Empirical results confirm that the addition of pose information reduces both short-term and long-term predictive error for pedestrian trajectories (Ghiya et al., 11 Mar 2025).

7. Discussion, Limitations, and Extensions

SGNetPose+ demonstrates that integrating pose skeletons or body-segment angles provides fine-grained motion understanding and enhances goal consistency in stepwise sequential prediction. Its hierarchical coarse-to-fine structure, coupled with simple temporal data augmentation, offsets the challenges posed by missing or noisy pose detections.

Notably, limitations arise from the restriction to 2D poses—without depth—and dataset reduction due to filtered frames. Remedying these shortcomings may involve 3D skeleton inference, domain adaptation for noisy video, and integration with transformer-based or real-time onboard systems.

A plausible implication—drawing on insights from SGNet's point cloud registration domain (Wu et al., 2023)—is that semantic-aware graph encoders, intrinsic saliency priors, and high-order geometric consistency can further enhance SGNetPose+ extensions. These additions could support not only trajectory prediction but also the joint regression of full 6D object pose in complex multi-object environments.

8. Implementation Notes

SGNetPose+ is publicly released in a modular PyTorch implementation, relying on ViTPose for pose detection and standard vision libraries. The design separates bounding box and pose encoders, goal attention, CVAE, and recurrent decoder modules, supporting flexible experimentation and downstream application (Ghiya et al., 11 Mar 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SGNetPose+.