Papers
Topics
Authors
Recent
Search
2000 character limit reached

Geometry-Aware Lifting Module for 3D HMR

Updated 5 February 2026
  • Geometry-Aware Lifting Module is a technique that lifts 2D joints and image features to coherent, kinematically consistent 3D pose sequences using state-space models.
  • It employs a dual-scan design that fuses global temporal context with local kinematic structure to ensure robust spatial-temporal feature alignment.
  • Empirical evaluations demonstrate state-of-the-art improvements in MPJPE, PA-MPJPE, and acceleration errors compared to traditional transformer or RNN-based HMR methods.

HMRMamba is a state-space model (SSM) based architecture for video-based 3D human mesh recovery (HMR), specifically designed to address limitations of prior approaches in temporal coherence, long-range dependency capture, and physically plausible 3D reconstruction. Unlike conventional transformer or RNN-based HMR pipelines, HMRMamba leverages Mamba-parameterized SSMs to enable efficient, long-range temporal modeling, and introduces a geometry-aware dual-scan mechanism for robust pose-anchored mesh reconstruction. The framework is empirically validated as state-of-the-art on major HMR benchmarks (Chen et al., 29 Jan 2026).

1. Architectural Principles and Design

HMRMamba consists of a two-stage pipeline: (1) a Geometry-Aware Lifting Module that lifts per-frame 2D joints and deep image features to a reliable, kinematically consistent 3D pose anchor sequence; (2) a Motion-Guided Reconstruction Network that regresses full 3D body meshes from these pose anchors, explicitly incorporating both short- and long-range motion. Both stages employ continuous-time SSMs with Mamba parameterizations, solved via zero-order hold discretization: dh(t)dt=Ah(t)+Bx(t),y(t)=Ch(t)\frac{dh(t)}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t) becoming

ht=Aˉht−1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t,\qquad y_t = C h_t

where (A, B, C) are state, input, and output matrices, respectively, and Aˉ\bar{A}, Bˉ\bar{B} are ZOH-discretized. Mamba enables efficient global 1D convolutions for sequence modeling.

A unique dual-scan architecture is employed within the SSM blocks: one scan follows the global linear joint or time index; the other is a local traversal (e.g., bone chain in the skeleton tree) to encode biomechanical priors. The outputs are fused multiplicatively after nonlinear transformation.

2. Geometry-Aware Lifting Module

The lifting module generates a stable 3D joint sequence (P3D∈RT×J×3P_{3D} \in \mathbb{R}^{T \times J \times 3}) from 2D input (P2DP_{2D}, FimgF_{img}) via a Spatial-Temporal Alignment (STA)-Mamba block:

  • Spatial Encoding: The fused per-joint feature vector is processed by a spatial Mamba block, aligning intra-frame structure.
  • Deformable Alignment: Deformable attention samples image features at learned offsets, refining spatial encoding for each joint through a weighted multi-head aggregation.
  • Temporal Modeling: Temporal Mamba block propagates context through the sequence, furnishing long-range dependencies necessary for coherent tracking over time.
  • Dual-Scan Block: Each Mamba layer performs both a global scan (linear in joint or time dimension) and a local scan (kinematic tree order), fusing outputs as

Ofused=σ(Conv1D(Oglobal))⊙Olocal,σ=SiLUO_{\text{fused}} = \sigma(\mathrm{Conv1D}(O_{\text{global}})) \odot O_{\text{local}},\qquad \sigma = \mathrm{SiLU}

yielding pose features informed by both global context and local kinematic structure.

3. Motion-Guided Reconstruction Network

The reconstruction network synthesizes mesh vertex sequences (Vmesh∈RT×6890×3V_{\text{mesh}} \in \mathbb{R}^{T \times 6890 \times 3}) using explicit and implicit motion cues from P3DP_{3D}:

  • Motion Encoding: Explicit motion is the finite difference Mexp(t)=P3D(t)−P3D(t−1)M_\mathrm{exp}^{(t)} = P_{3D}^{(t)} - P_{3D}^{(t-1)}; implicit motion is a learnable embedding added to image features.
  • Motion-Aware Attention: Mesh features are computed via attention, with queries from image features and keys/values from projected explicit and implicit motion: Fmotion=softmax(QKTdk)VF_{\mathrm{motion}} = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
  • Regression Head: A terminal MLP regresses the full set of mesh vertices from these temporally enhanced representations.

4. Optimization and Loss Functions

Two complementary objectives are used:

  1. Pose-Lifting Loss:

Lpose=L3D+λtLt+λmLm+λ2DL2D\mathcal{L}_{\mathrm{pose}} = \mathcal{L}_{3D} + \lambda_t \mathcal{L}_t + \lambda_m \mathcal{L}_m + \lambda_{2D} \mathcal{L}_{2D}

where L3D\mathcal{L}_{3D} is standard MPJPE, Lt\mathcal{L}_t is temporal consistency, Lm\mathcal{L}_m is velocity error (MPJVE), and L2D\mathcal{L}_{2D} is the 2D joint reprojection loss. Default λ\lambda values are (0.5, 20, 0.5).

  1. Mesh Loss:

Lmesh=λmLmeshV+λjLjoint3D+λnLnormal+λeLedge\mathcal{L}_{\mathrm{mesh}} = \lambda_m \mathcal{L}_{\rm meshV} + \lambda_j \mathcal{L}_{\rm joint3D} + \lambda_n \mathcal{L}_{\rm normal} + \lambda_e \mathcal{L}_{\rm edge}

with respective weights (1, 1, 0.1, 20).

No explicit bone-length or surface regularization is mandated; geometric consistency is predominantly enforced through architecture and these losses.

5. Experimental Evaluation and Results

HMRMamba was benchmarked on 3DPW, MPI-INF-3DHP, and Human3.6M with standard metrics (MPJPE, PA-MPJPE, MPVPE, and acceleration error):

Method 3DPW MPJPE PA-MPJPE MPVPE MPI-INF MPJPE H3.6M MPJPE
PMCE 69.5 46.7 84.8 79.7 53.5
ARTS 67.7 46.5 81.4 71.8 51.6
Ours-S 66.9 46.3 81.4 70.1 51.2
Ours-L 64.8 45.5 79.8 68.3 49.3

HMRMamba achieves state-of-the-art on all core metrics, offering lower error with reduced parameters and comparable or reduced FLOPs relative to earlier transformer or SSM approaches.

Several ablation studies demonstrate:

  • Improved performance with both dual-scan alignment and explicit/implicit motion cues.
  • Robustness to 2D pose detector choice.
  • Efficiency in parameter and FLOP count over competing models.

6. Implementation Details

Key parameters and protocol:

  • Image features from ResNet-50 (SPIN-pretrained); 2D estimator: CPN or ViTPose.
  • Sequence length T=16T=16, stride 4 frames.
  • Pose lifting: 3–5 Mamba SSM layers, dimension 256–512.
  • Mesh reconstruction: additional SSM and attention layers.
  • Training: Adam(lr=2×10−42 \times 10^{-4}), batch 64, 100+20 epochs, weight decay 0.01 for pose, 0.00 for mesh head.
  • Implemented on RTX 4090 GPU.

7. Limitations and Future Directions

HMRMamba currently relies on supervised 2D/3D correspondence and does not explicitly enforce bone-length or surface-based losses. Potential research avenues include:

  • Incorporation of explicit kinematic/surface regularizers.
  • Extension to unsupervised or weakly supervised and real-time or mobile HMR.
  • Further exploration of SSM compression and task-adaptive temporal modeling (Chen et al., 29 Jan 2026).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometry-Aware Lifting Module.