SGNetPose+: Pose-Driven Trajectory Forecasting
- The paper introduces a dual-encoder, stepwise goal attention, and CVAE module to integrate pose and bounding box data for trajectory forecasting.
- It employs ViTPose to extract 2D keypoints and computes key joint angles to capture nuanced pedestrian movements.
- Empirical evaluations on JAAD_pose and PIE_pose datasets demonstrate reduced forecasting errors and improved prediction metrics.
SGNetPose+ refers to a set of advanced stepwise goal-driven neural architectures that incorporate explicit pose information for structured prediction tasks—primarily pedestrian trajectory forecasting for autonomous driving, but also, in alternative formulations, for high-dimensional whole-body 3D pose estimation. The essential innovation of SGNetPose+ is to condition predictive modeling not only on bounding-box trajectories, but also on fine-grained pose cues such as joint coordinates or body-segment angles, using a modular design that encodes spatial–temporal dependencies via recurrent attention and semantic graph networks (Ghiya et al., 11 Mar 2025, Wen et al., 2024).
1. Architectural Foundation and Motivation
SGNetPose+ builds upon the original SGNet (Salient Geometric Network) framework, which was developed for geometric tasks such as point cloud registration (Wu et al., 2023). The stepwise goal-driven paradigm is adapted by SGNetPose+ for sequential pedestrian trajectory prediction, addressing the limitations of purely bounding-box-based models that lack contextual cues about pedestrians’ intent or imminent changes in movement.
Motivation arises from the observation that human locomotion is highly non-linear and context-dependent—critical for autonomous driving systems, which must anticipate rare but hazardous behaviors (e.g., abrupt crossing, acceleration, turning). SGNetPose+ augments the trajectory modeling pipeline through explicit encoding of 2D pose skeletons or joint angles, enabling it to leverage kinematic patterns such as knee flexion, stride initiation, or torso lean, which serve as predictive signals for future motion (Ghiya et al., 11 Mar 2025).
2. Pose Extraction and Representation
Pose extraction in SGNetPose+ is performed using a pre-trained ViTPose human-pose transformer to generate 13 two-dimensional keypoints per pedestrian per frame: nose, shoulders, elbows, wrists, hips, knees, and ankles. Each keypoint is a vector in for .
From these joints, 12 anatomically meaningful angles are computed using the formula: where is the angle at joint formed by segments – and –. This parameterization encodes both pose and underlying body kinematics, as in the left knee angle (hip–knee–ankle) or torso orientation (hip–shoulder–contralateral hip) (Ghiya et al., 11 Mar 2025).
3. Data Augmentation and Preprocessing
To mitigate sample loss from invalid or undetectable pose frames, SGNetPose+ employs temporal augmentation by horizontal flipping of video data. This process duplicates the training set, preserves dynamic constraints, and addresses network invariance to left-right crossing maneuvers. For a frame of width , coordinates are mapped to under flipping, synchronously transforming bounding boxes and pose keypoints (Ghiya et al., 11 Mar 2025).
This strategy yields substantial increases in effective training-data size—118/18/102 batches for train/val/test splits on JAAD_pose, with similar expansion observed on the PIE dataset.
4. Network Architecture: Dual-Encoder and Stepwise Goal-Attention
SGNetPose+ consists of parallel input encoders for bounding box and pose features, a stepwise goal-driven attention (SGE) module, a conditional variational autoencoder (CVAE) for multimodal latent trajectory modeling, and a recurrent decoder.
- Bounding-Box Encoder: Each frame's box is linearly embedded into latent feature .
- Pose Encoder: 2D keypoints are flattened and mapped via a fully connected layer with dropout to .
- Stepwise Goal-Driven Attention (SGE): For each time-step , previous aggregated goal is concatenated with box embedding , and processed by recurrent units (GRU) for encoder () and goal generator (). Goal aggregation employs trainable attention:
- CVAE Goal Modeling: Captures uncertainty by inferring a latent variable via both recognition () and prior () networks.
- Decoder: Sequentially regresses future trajectory points with GRU state updates and goal conditioning:
5. Training Objectives and Loss Functions
SGNetPose+ is optimized end-to-end using a composite objective: where:
- is the future trajectory regression loss,
- is the Kullback-Leibler divergence between the posterior and prior of the CVAE latent variable,
- in practice (Ghiya et al., 11 Mar 2025).
Optimization is performed using the Adam algorithm (lr , batch size 16, dropout 0.5) with convergence in 30–40 epochs.
6. Experimental Evaluation and Results
SGNetPose+ is evaluated on the "JAAD_pose" and "PIE_pose" pedestrian datasets, both filtered for valid ViTPose extractions and augmented via flipping.
Performance is measured using:
- : mean squared error frames ahead,
- : final-frame MSE at prediction horizon,
- /: centroid MSE and final centroid MSE.
Quantitative improvements are evidenced as follows:
| Dataset | Model | MSE(15) | MSE(45/30) | FMSE | Relative Δ |
|---|---|---|---|---|---|
| JAAD_pose | SGNet bb | 68.46 | 384.25 | 1250.61 | – |
| JAAD_pose | SGNetPose+ | 62.62 | 347.07 | 1080.45 | –13.6% (FMSE) |
| PIE_pose | SGNet bb | 17.05 | 41.98 | — | — |
| PIE_pose | SGNetPose+ | 15.81 | 40.08 | — | –7.3% (MSE15) |
Empirical results confirm that the addition of pose information reduces both short-term and long-term predictive error for pedestrian trajectories (Ghiya et al., 11 Mar 2025).
7. Discussion, Limitations, and Extensions
SGNetPose+ demonstrates that integrating pose skeletons or body-segment angles provides fine-grained motion understanding and enhances goal consistency in stepwise sequential prediction. Its hierarchical coarse-to-fine structure, coupled with simple temporal data augmentation, offsets the challenges posed by missing or noisy pose detections.
Notably, limitations arise from the restriction to 2D poses—without depth—and dataset reduction due to filtered frames. Remedying these shortcomings may involve 3D skeleton inference, domain adaptation for noisy video, and integration with transformer-based or real-time onboard systems.
A plausible implication—drawing on insights from SGNet's point cloud registration domain (Wu et al., 2023)—is that semantic-aware graph encoders, intrinsic saliency priors, and high-order geometric consistency can further enhance SGNetPose+ extensions. These additions could support not only trajectory prediction but also the joint regression of full 6D object pose in complex multi-object environments.
8. Implementation Notes
SGNetPose+ is publicly released in a modular PyTorch implementation, relying on ViTPose for pose detection and standard vision libraries. The design separates bounding box and pose encoders, goal attention, CVAE, and recurrent decoder modules, supporting flexible experimentation and downstream application (Ghiya et al., 11 Mar 2025).