SGNetPose+: Pose-Driven Trajectory Forecasting

Updated 11 December 2025

The paper introduces a dual-encoder, stepwise goal attention, and CVAE module to integrate pose and bounding box data for trajectory forecasting.
It employs ViTPose to extract 2D keypoints and computes key joint angles to capture nuanced pedestrian movements.
Empirical evaluations on JAAD_pose and PIE_pose datasets demonstrate reduced forecasting errors and improved prediction metrics.

SGNetPose+ refers to a set of advanced stepwise goal-driven neural architectures that incorporate explicit pose information for structured prediction tasks—primarily pedestrian trajectory forecasting for autonomous driving, but also, in alternative formulations, for high-dimensional whole-body 3D pose estimation. The essential innovation of SGNetPose+ is to condition predictive modeling not only on bounding-box trajectories, but also on fine-grained pose cues such as joint coordinates or body-segment angles, using a modular design that encodes spatial–temporal dependencies via recurrent attention and semantic graph networks (Ghiya et al., 11 Mar 2025, Wen et al., 2024).

1. Architectural Foundation and Motivation

SGNetPose+ builds upon the original SGNet (Salient Geometric Network) framework, which was developed for geometric tasks such as point cloud registration (Wu et al., 2023). The stepwise goal-driven paradigm is adapted by SGNetPose+ for sequential pedestrian trajectory prediction, addressing the limitations of purely bounding-box-based models that lack contextual cues about pedestrians’ intent or imminent changes in movement.

Motivation arises from the observation that human locomotion is highly non-linear and context-dependent—critical for autonomous driving systems, which must anticipate rare but hazardous behaviors (e.g., abrupt crossing, acceleration, turning). SGNetPose+ augments the trajectory modeling pipeline through explicit encoding of 2D pose skeletons or joint angles, enabling it to leverage kinematic patterns such as knee flexion, stride initiation, or torso lean, which serve as predictive signals for future motion (Ghiya et al., 11 Mar 2025).

2. Pose Extraction and Representation

Pose extraction in SGNetPose+ is performed using a pre-trained ViTPose human-pose transformer to generate 13 two-dimensional keypoints per pedestrian per frame: nose, shoulders, elbows, wrists, hips, knees, and ankles. Each keypoint $p_j$ is a vector in $\mathbb{R}^2$ for $j=1, \ldots, 13$ .

From these joints, 12 anatomically meaningful angles are computed using the formula: $\theta_{ijk} = \arccos \left( \frac{(p_i - p_j) \cdot (p_k - p_j)}{\|p_i - p_j\| \; \|p_k - p_j\|} \right)$ where $\theta_{ijk}$ is the angle at joint $j$ formed by segments $i$ – $j$ and $k$ – $j$ . This parameterization encodes both pose and underlying body kinematics, as in the left knee angle (hip–knee–ankle) or torso orientation (hip–shoulder–contralateral hip) (Ghiya et al., 11 Mar 2025).

3. Data Augmentation and Preprocessing

To mitigate sample loss from invalid or undetectable pose frames, SGNetPose+ employs temporal augmentation by horizontal flipping of video data. This process duplicates the training set, preserves dynamic constraints, and addresses network invariance to left-right crossing maneuvers. For a frame of width $W$ , $(x,y)$ coordinates are mapped to $(W-x, y)$ under flipping, synchronously transforming bounding boxes and pose keypoints (Ghiya et al., 11 Mar 2025).

This strategy yields substantial increases in effective training-data size—118/18/102 batches for train/val/test splits on JAAD_pose, with similar expansion observed on the PIE dataset.

4. Network Architecture: Dual-Encoder and Stepwise Goal-Attention

SGNetPose+ consists of parallel input encoders for bounding box and pose features, a stepwise goal-driven attention (SGE) module, a conditional variational autoencoder (CVAE) for multimodal latent trajectory modeling, and a recurrent decoder.

Bounding-Box Encoder: Each frame's box $x_t \in \mathbb{R}^4$ is linearly embedded into latent feature $x_t^e \in \mathbb{R}^d$ .
Pose Encoder: 2D keypoints $p_t \in \mathbb{R}^{13 \times 2}$ are flattened and mapped via a fully connected layer with dropout to $p_t^e \in \mathbb{R}^d$ .
Stepwise Goal-Driven Attention (SGE): For each time-step $t$ , previous aggregated goal $\tilde{x}_t^e$ is concatenated with box embedding $x_t^e$ , and processed by recurrent units (GRU) for encoder ( $h_t^e$ ) and goal generator ( $h_t^g$ ). Goal aggregation employs trainable attention: $w = \operatorname{Softmax}(W_w^T \tanh(h^g) + b_w),\qquad \tilde{x}_{t+i}^e = \sum_{s=t+i}^{t+l_d} w_s h_s^g$
CVAE Goal Modeling: Captures uncertainty by inferring a latent variable $z$ via both recognition ( $\mathcal{N}(\mu^q, \sigma^q)$ ) and prior ( $\mathcal{N}(\mu^p, \sigma^p)$ ) networks.
Decoder: Sequentially regresses future trajectory points $\hat{y}_{t+i}$ with GRU state updates and goal conditioning: $\hat{y}_{t+i} = W_o h^d_{t+i} + b_o,\qquad \hat{y} \in \mathbb{R}^2$

5. Training Objectives and Loss Functions

SGNetPose+ is optimized end-to-end using a composite objective: $\mathcal{L} = \mathcal{L}_\mathrm{traj} + \beta \mathcal{L}_\mathrm{KL}$ where:

$\mathcal{L}_\mathrm{traj} = \sum_{i=1}^{l_d} \| y_{t+i} - \hat{y}_{t+i} \|_2^2$ is the future trajectory regression loss,
$\mathcal{L}_\mathrm{KL}$ is the Kullback-Leibler divergence between the posterior and prior of the CVAE latent variable,
$\beta=1$ in practice (Ghiya et al., 11 Mar 2025).

Optimization is performed using the Adam algorithm (lr $=10^{-3}$ , batch size 16, dropout 0.5) with convergence in 30–40 epochs.

6. Experimental Evaluation and Results

SGNetPose+ is evaluated on the "JAAD_pose" and "PIE_pose" pedestrian datasets, both filtered for valid ViTPose extractions and augmented via flipping.

Performance is measured using:

$\mathrm{MSE}(k)$ : mean squared error $k$ frames ahead,
$\mathrm{FMSE}$ : final-frame MSE at prediction horizon,
$\mathrm{CMSE}$ / $\mathrm{CFMSE}$ : centroid MSE and final centroid MSE.

Quantitative improvements are evidenced as follows:

Dataset	Model	MSE(15)	MSE(45/30)	FMSE	Relative Δ
JAAD_pose	SGNet bb	68.46	384.25	1250.61	–
JAAD_pose	SGNetPose+	62.62	347.07	1080.45	–13.6% (FMSE)
PIE_pose	SGNet bb	17.05	41.98	—	—
PIE_pose	SGNetPose+	15.81	40.08	—	–7.3% (MSE15)

Empirical results confirm that the addition of pose information reduces both short-term and long-term predictive error for pedestrian trajectories (Ghiya et al., 11 Mar 2025).

7. Discussion, Limitations, and Extensions

SGNetPose+ demonstrates that integrating pose skeletons or body-segment angles provides fine-grained motion understanding and enhances goal consistency in stepwise sequential prediction. Its hierarchical coarse-to-fine structure, coupled with simple temporal data augmentation, offsets the challenges posed by missing or noisy pose detections.

Notably, limitations arise from the restriction to 2D poses—without depth—and dataset reduction due to filtered frames. Remedying these shortcomings may involve 3D skeleton inference, domain adaptation for noisy video, and integration with transformer-based or real-time onboard systems.

A plausible implication—drawing on insights from SGNet's point cloud registration domain (Wu et al., 2023)—is that semantic-aware graph encoders, intrinsic saliency priors, and high-order geometric consistency can further enhance SGNetPose+ extensions. These additions could support not only trajectory prediction but also the joint regression of full 6D object pose in complex multi-object environments.

8. Implementation Notes

SGNetPose+ is publicly released in a modular PyTorch implementation, relying on ViTPose for pose detection and standard vision libraries. The design separates bounding box and pose encoders, goal attention, CVAE, and recurrent decoder modules, supporting flexible experimentation and downstream application (Ghiya et al., 11 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (3)

SGNetPose+: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving (2025)

3D WholeBody Pose Estimation based on Semantic Graph Attention Network and Distance Information (2024)

SGNet: Salient Geometric Network for Point Cloud Registration (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SGNetPose+.

SGNetPose+: Pose-Driven Trajectory Forecasting

1. Architectural Foundation and Motivation

2. Pose Extraction and Representation

3. Data Augmentation and Preprocessing

4. Network Architecture: Dual-Encoder and Stepwise Goal-Attention

5. Training Objectives and Loss Functions

6. Experimental Evaluation and Results

7. Discussion, Limitations, and Extensions

8. Implementation Notes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SGNetPose+: Pose-Driven Trajectory Forecasting

1. Architectural Foundation and Motivation

2. Pose Extraction and Representation

3. Data Augmentation and Preprocessing

4. Network Architecture: Dual-Encoder and Stepwise Goal-Attention

5. Training Objectives and Loss Functions

6. Experimental Evaluation and Results

7. Discussion, Limitations, and Extensions

8. Implementation Notes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research