3D Social Pose Features in Social AI

Updated 10 November 2025

3D social pose features are compact, physically interpretable descriptors derived from key 3D joint positions, capturing head location and gaze direction to enable social interaction analysis.
They are extracted using robust methods such as keypoint recovery, depth correction, and temporal aggregation, ensuring precise feature construction from full skeletal data.
Integration with transformer-based encoders and LSTM social pooling models significantly improves trajectory forecasting and social scene understanding by reducing prediction errors.

3D social pose features are compact, physically interpretable descriptors of human body posture and orientation in three-dimensional space that robustly support the analysis, prediction, and generation of socially interactive behaviors. Unlike generic 3D skeleton vectors, social pose features emphasize spatial relations and orientation cues that underlie social scene understanding, facilitate accurate trajectory forecasting, and enable effective machine judgment of social interactions. Recent evidence demonstrates both the necessity and sufficiency of these low-dimensional spatial primitives—most notably, head-face 3D position and gaze direction—across tasks in human and machine social cognition.

1. Mathematical Definitions and Formulations

The formalization of 3D social pose features relies on a minimal set of position and orientation vectors derived from full multi-joint skeletons. For an individual $i$ :

Full 3D joint vector:

$J_i = \{p_{i,k} \in \mathbb{R}^3: k=1\ldots 45\}$

where $p_{i,k}$ are the 3D coordinates of anatomical landmarks such as eyes, nose, neck, pelvis, etc.

Head-center position:

$p_i = \frac{p_{i,\mathrm{eyeL}} + p_{i,\mathrm{eyeR}}}{2} \in \mathbb{R}^3$

Gaze (face-orientation) vector:

Compute:

$u^1_i = \frac{p_{i,\mathrm{nose}} - p_i}{\|p_{i,\mathrm{nose}} - p_i\|}$

$u^2_i = \frac{p_{i,\mathrm{nose}} - p_{i,\mathrm{neck}}}{\|p_{i,\mathrm{nose}} - p_{i,\mathrm{neck}}\|}$

Then average and renormalize:

$v_i = \frac{u^1_i + u^2_i}{\|u^1_i + u^2_i\|} \in \mathbb{R}^3$

Dyadic features for a pair $(1,2)$ $(1, 2)$ :
- Inter-agent distance:
$d_{12} = \|p_1 - p_2\|$ - Face-to-face angle:

$\theta_{12} = \arccos(v_1 \cdot v_2)$

and

$\theta_{i \rightarrow j} = \arccos\left(v_i \cdot \frac{p_j - p_i}{\|p_j - p_i\|}\right)$

The full set per dyad consists of $(x, y, z)$ of $p_i$ plus $(dx, dy, dz)$ of $v_i$ for both entities—a 12-dimensional vector.

2. Extraction and Embedding Pipelines

Extraction of 3D social pose features proceeds via several established steps:

Body and Head Keypoint Recovery: Algorithms such as “4D Humans” (ViT-based HMR 2.0) are run on video frames to regress 3D skeletons (typically SMPL-X parameterization with 45 joints per human) (Qin et al., 6 Nov 2025).
Depth Correction: To address adult-biased priors (e.g., for younger or occluded individuals), global $z$ -translation is refined using monocular BEV depth estimation (Qin et al., 6 Nov 2025).
Temporal Aggregation: For each video, keypoints are temporally averaged when appropriate.
Feature Construction: Head centers and gaze vectors are calculated via the above formulas, centered/oriented in a physically meaningful coordinate system.
Dimensionality Reduction: When integrating with high-dimensional DNN vision embeddings, features may be SRP-projected (e.g., down to $D=4\,732$ features for DNNs versus a fixed 12-dimension for social pose descriptors) (Qin et al., 6 Nov 2025).
Model Fusion: For downstream prediction or fusion tasks, social pose features are concatenated or input as an additional data stream alongside learned visual representations.

For trajectory forecasting tasks, pose vectors are typically constructed in a local frame centered at the pelvis: $\mathbf{x}_{\text{pose},i}^t = \left[x_{i,1}^t - x_{i,p}^t,\, y_{i,1}^t - y_{i,p}^t,\, z_{i,1}^t - z_{i,p}^t,\, \dots,\, x_{i,J}^t - x_{i,p}^t,\, y_{i,J}^t - y_{i,p}^t,\, z_{i,J}^t - z_{i,p}^t\right]^{\top} \in \mathbb{R}^{3J}$ where $(x_{i,p}^t, y_{i,p}^t, z_{i,p}^t)$ is the pelvis joint at time $t$ (Gao et al., 30 Jul 2025).

Integration of 3D social pose features into computational models employs both dedicated geometric encoders and multimodal architectures:

Transformer-Based Encoders: The Social-Pose encoder maps per-frame, per-agent pose features $\mathbf{x}_{\text{pose},i}^t$ into hidden states using linear embedding, sinusoidal temporal positional encoding, and stacks of multi-head self-attention blocks (Gao et al., 30 Jul 2025).
Interaction Encoders: After fusing a trajectory encoding $H_{\mathrm{traj},i}$ with a pose encoding $H_{\mathrm{pose},i}$ via concatenation, a set $\{H_i\}_{i=1}^n$ is passed into an interaction encoder:
- Transformer-style spatial self-attention (as in Autobots): updates each agent’s code based on attention to all others.
- LSTM-based social pooling (as in Social-LSTM / Social-GAN): aggregates hidden states of agents in a spatial grid (Gao et al., 30 Jul 2025).
Latent Diffusion Models: For wearable ego-pose estimation, SEE-ME injects social cues not via explicit pairwise metrics (e.g., distance, gaze angle), but by conditioning the latent denoiser directly on the full interactee’s pose embedding, $E_p(P^i)$ , as well as on the 3D scene encoding (Scofano et al., 7 Nov 2024).
Feature Fusion with Vision Backbones: When evaluating human social judgment, the 12-dimensional 3D social pose features are fused with DNN embeddings (e.g., via concatenation prior to ridge regression) (Qin et al., 6 Nov 2025).

4. Key Empirical Findings and Comparative Analyses

Empirical studies consistently show substantial benefits of incorporating explicit 3D social pose features:

Trajectory Forecasting: Addition of a 3D pose encoder reduces average displacement error (ADE) and final displacement error (FDE) by 25% relative to trajectory-only baselines (e.g., from 1.20/2.70m to 0.90/1.91m on JTA), outperforming the use of only 2D joint data (≈15% lower ADE when upgrading from 2D to 3D) (Gao et al., 30 Jul 2025).
Social Scene Understanding: Explicit 3D social pose features alone yield Pearson correlations with human social-judgment ratings on par with or above those of state-of-the-art vision DNNs, particularly for interaction-related attributes (e.g., “agents facing”: $r\approx 0.80$ for 3D pose vs. $r\approx 0.57$ for vision DNNs) (Qin et al., 6 Nov 2025).
Feature Sufficiency: The 12-dimensional head center and facing vector descriptors capture essentially all variance explained by full 3D joint vectors for social ratings (semi-partial $r\approx 0.07$ –0.24 residual) (Qin et al., 6 Nov 2025).
Model Synergy: Augmenting high-dimensional learned embeddings with the 3D social pose features increases test-set prediction accuracy for all social judgment dimensions, with improvements in Pearson $r$ up to $+0.29$ for “agents facing” (Qin et al., 6 Nov 2025).
Pose Conditioning in Egocentric Estimation: Conditioning on the interactee’s pose embedding via SEE-ME leads to a 53% reduction in mean per-joint position error (MPJPE), with the strongest gains for close proximity ( $d_t < 1$ m: MPJPE 119mm vs. 126mm overall; translation and acceleration errors also decrease) and mutual gaze ( $\phi_t < 30^{\circ}$ : MPJPE 117mm vs. 126mm) (Scofano et al., 7 Nov 2024).

Interpretability of 3D social pose features is enhanced through analysis of physically meaningful social cues:

Feature	Mathematical Formulation	Empirical Effect (MPJPE reduction)
Interpersonal Distance	$d_t = \\|r_t^w - r_t^i\\|_2$	$d_t<1$ m: MPJPE 119mm vs. 126mm (6% gain)
Mutual Gaze Direction	$\phi_t = \arccos\left(\frac{\langle g_t^w, g_t^i \rangle}{\\|g_t^w\\|\\|g_t^i\\|}\right)$	$\phi_t<30^{\circ}$ : MPJPE 117mm (7% gain)
Future Interactee Mot.	$P^i_{\text{future}} = \{p_{t+\Delta}^i\}_{t=1}^F$ , $\Delta>0$	Future pose: MPJPE 123mm vs. 126mm (2% gain)

Although these scalar cues (distance, gaze angle, anticipation) inform ablation protocols, most high-performing models consume the full latent pose (e.g., interactee’s SMPL embedding or per-joint vectors), integrating the relevant relational geometry implicitly (Scofano et al., 7 Nov 2024, Gao et al., 30 Jul 2025).

6. Implementation Considerations and Limitations

Several aspects influence the practical deployment and interpretation of 3D social pose feature models:

Input Dimensionality: Raw 3D poses typically have dimensionality $3J$ ( $J=17$ –$45$, depending on dataset or recovery algorithm), while social pose features minimize redundancy (12-dimension for dyads) (Qin et al., 6 Nov 2025, Gao et al., 30 Jul 2025).
Coordinate Frames: Local normalization (e.g., pose relative to pelvis) improves invariance to global motion (Gao et al., 30 Jul 2025).
Robustness to Noise: When pose input is corrupted with Gaussian noise at train-time (std=0.1) and inference (up to std=0.5), pose-based models maintain $\geq$ 15% improvement over trajectory-only baselines (Gao et al., 30 Jul 2025).
Computational Cost: Extraction of 3D pose information, especially via keypoint-overlaid skeleton images, adds a 8–19% time overhead to inference in generative diffusion models but reduces MPJPE by up to 55% (Martin-Ozimek et al., 18 Jan 2025).
Dataset Constraints: Supervision and evaluation require accurate 3D joint localization, which can be challenging under occlusion or in egocentric views (Scofano et al., 7 Nov 2024, Qin et al., 6 Nov 2025).

A plausible implication is that reducing the reliance on high-resolution image-based conditioning in favor of structured 3D features can yield both performance and efficiency gains, provided reliable joint recovery is feasible.

7. Theoretical and Cognitive Context

The empirical findings converge with cognitive theories proposing that social vision—both human and machine—relies fundamentally on explicit geometric reasoning involving position and orientation, rather than high-dimensional or opaque learned embeddings. Minimal 3D pose-based representations are not only sufficient for a broad array of social perception tasks, but also provide explainable signals for engineering, interpretation, and model integration. This suggests future advancements in social AI may depend less on deep feature expansion and more on principled geometric abstraction and reasoning.

In summary, 3D social pose features operationalize spatial primitives critical for social interaction modeling. They demonstrably enhance both model-based forecasting and machine-powered social understanding, align closely with human judgment, and provide a mathematically concise yet interpretable substrate for next-generation vision and interaction systems (Gao et al., 30 Jul 2025, Scofano et al., 7 Nov 2024, Martin-Ozimek et al., 18 Jan 2025, Qin et al., 6 Nov 2025).