Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

Published 4 Apr 2026 in cs.CV | (2604.03652v1)

Abstract: Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents MASC-Pose, which integrates adaptive multi-scale temporal modeling with skeleton-constrained spatial graphs to improve accuracy and reduce computational cost.
It leverages an AMTM module for capturing variable temporal dependencies and a SAGCN for joint-specific spatial message passing, enhancing model adaptivity.
Experimental results on benchmarks like Human3.6M and MPI-INF-3DHP demonstrate superior performance in MPJPE and efficiency compared to prior methods.

Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

Introduction

This work introduces MASC-Pose, an efficient framework for 3D human pose estimation from monocular 2D pose sequences, combining adaptive multi-scale temporal modeling with skeleton-constrained spatial graphs. The method directly addresses key limitations of prior Transformer-based and graph-based approaches in spatial and temporal dependency modeling—specifically, the inefficiency and lack of adaptability in handling diverse and heterogeneous motion dynamics across both long- and short-term temporal contexts, as well as the limited flexibility in modeling joint-specific spatial interactions. The hybrid architecture leverages an Adaptive Multi-scale Temporal Modelling (AMTM) module to capture variable temporal dependencies and a Skeleton-constrained Adaptive GCN (SAGCN) to enable joint-specific spatial message passing.

Previous methods for 3D human pose estimation, such as PoseFormer, MixSTE, and TCPFormer, have primarily used full-sequence or fixed-scale temporal attention, often employing dense Transformer-based architectures with limited inductive bias and high computational complexity. Meanwhile, graph-based methods utilizing skeletal priors have shown improvements in spatial structure modeling but typically rely on fixed aggregation schemes that do not optimally balance self-information and neighboring features for all joints.

The central insight motivating MASC-Pose is that human motions possess inherently multi-scale and heterogeneous characteristics: actions like walking and running require long-term context for periodicity, whereas hand gestures and transitions demand short, fine-grained modeling. Similarly, different joints demonstrate variegated spatial relationships depending on the action and pose.

Figure 1: Comparison of temporal dependency modeling strategies and efficiency–accuracy trade-offs, with the proposed multi-scale method (c) providing improved accuracy at reduced computational cost.

Methodology

Skeleton-Constrained Adaptive GCN (SAGCN)

SAGCN constitutes the backbone for spatial feature encoding. This component replaces uniform aggregation by decoupling self-feature transformation and neighbor aggregation via skeleton-derived adjacency, enabling adaptivity in the propagation of information for each joint:

Self and Neighbor Balancing: Learnable weights, normalized via softmax, are introduced to control the relative influence of self-features and neighbor features for every joint at each layer.
Layerwise Aggregation: Features are independently propagated for each joint, with neighbor information being filtered through the skeleton topology, preventing over-smoothing.
Optimization: Ablation analyzes confirm that using a single hop ( $K=1$ ) in spatial GCN layers is optimal, avoiding the detrimental effects of extended aggregation.
Figure 2: Schematic of MASC-Pose, highlighting (a) SAGCN for joint-specific spatial modeling and (b) AMTM for multi-scale temporal modeling.

Adaptive Multi-scale Temporal Modelling (AMTM)

AMTM addresses temporal modeling via the following mechanisms:

Multi-scale Temporal Windows: Three temporal branches process short (9 frames), medium (27 frames), and long (81 frames) temporal windows in parallel, capturing both rapid transitions and long-range dependencies.
Soft Routing via Mixture-of-Experts Paradigm: A lightweight global average pool + MLP predicts a softmax-weighted combination of the different temporal branches for each instance, leading to adaptive focus on relevant window sizes.
Sparse Temporal Graph Convolution (STGC): Efficient temporal dependencies are modeled using sparse temporal adjacency matrices computed via top- $k$ similarity, supporting scalable complexity.
Progressive Layer Fusion: A softmax-based fusion module combines multi-scale and original features and aggregates across multiple layers, improving representational granularity and gradient propagation.

Experimental Results and Analysis

Quantitative Performance

Extensive experiments conducted on Human3.6M and MPI-INF-3DHP validate the approach. MASC-Pose attains state-of-the-art performance in terms of MPJPE (Mean Per Joint Position Error), while requiring fewer parameters and lower computational cost (MACs/frame) compared to contemporary Transformer-based and hybrid models, including TCPFormer, KTPFormer, and MotionAGFormer. On Human3.6M, the method achieves an MPJPE of 37.8 mm, outperforming other methods at significantly reduced MACs/frame.

Figure 1: Efficiency–accuracy plot indicating superiority of MASC-Pose in the efficiency regime.

In the MPI-INF-3DHP benchmark, MASC-Pose matches or exceeds best-in-class scores for PCK and AUC, maintaining robust generalization across diverse motion types.

Ablations and Adaptivity

Ablation studies substantiate the importance of both the AMTM and SAGCN modules, with adaptive multi-scale temporal aggregation contributing the largest accuracy gain. Notably, adaptive weight learning within AMTM surpasses fixed-weights or non-adaptive configurations. The optimal temporal scale configuration ( $[9,27,81]$ ) balances the capture of both short- and long-term dynamics. Reducing the spatial propagation range to direct neighbors ( $K=1$ ) within SAGCN avoids performance loss due to over-smoothing.

Qualitative Comparisons

Visual analysis on in-the-wild and Human3.6M sequences reveals increased stability and anatomical plausibility of MASC-Pose output relative to TCPFormer and MotionAGFormer in challenging frames affected by ambiguous or noisy 2D detections.

Figure 3: Output comparison shows superior robustness to input errors and ambiguity for MASC-Pose compared to prior methods in unconstrained videos.

Figure 4: On canonical Human3.6M actions, predicted joint locations by MASC-Pose better match ground-truth especially in difficult poses.

Temporal Scale Adaptivity

Analysis of learned temporal scale weights demonstrates action- and joint-specific specialization. For example, repetitive actions (Walk) afford higher weights on long-term branches for lower body, while transitional actions (SittingDown) emphasize short/medium-term information across all groups.

Figure 5: Learned scale weights (short/medium/long) over joint groups for different actions demonstrate dynamic adaptation at both motion and anatomical granularity.

Practical and Theoretical Implications

MASC-Pose demonstrates that joint adaptivity in both spatial and temporal domains can be realized efficiently with a hybrid network that leverages domain priors (skeleton graphs) and context-aware soft-routing (multi-scale temporal branches). The architecture provides a blueprint for future work aiming to model dynamic structure in vision tasks involving articulated objects or multi-scale dependencies.

Potential practical extensions include application to real-time 3D pose estimation in human–computer interaction and virtual reality, as well as further generalization to out-of-distribution motion and cross-dataset settings. The modular, lightweight design supports deployment in resource-constrained environments.

Conclusion

MASC-Pose yields a compelling architectural innovation for accurate and efficient 3D human pose estimation. By integrating adaptive, context-sensitive multi-scale temporal modeling with skeleton-constrained spatial graph convolutions, the approach outperforms previous methods in both accuracy and resource efficiency. Future research can extend this framework toward more flexible window partitioning, overlapping temporal scales, and real-world, cross-dataset generalization.

Markdown Report Issue