Geometry-Aware Lifting Module for 3D HMR

Updated 5 February 2026

Geometry-Aware Lifting Module is a technique that lifts 2D joints and image features to coherent, kinematically consistent 3D pose sequences using state-space models.
It employs a dual-scan design that fuses global temporal context with local kinematic structure to ensure robust spatial-temporal feature alignment.
Empirical evaluations demonstrate state-of-the-art improvements in MPJPE, PA-MPJPE, and acceleration errors compared to traditional transformer or RNN-based HMR methods.

HMRMamba is a state-space model (SSM) based architecture for video-based 3D human mesh recovery (HMR), specifically designed to address limitations of prior approaches in temporal coherence, long-range dependency capture, and physically plausible 3D reconstruction. Unlike conventional transformer or RNN-based HMR pipelines, HMRMamba leverages Mamba-parameterized SSMs to enable efficient, long-range temporal modeling, and introduces a geometry-aware dual-scan mechanism for robust pose-anchored mesh reconstruction. The framework is empirically validated as state-of-the-art on major HMR benchmarks (Chen et al., 29 Jan 2026).

1. Architectural Principles and Design

HMRMamba consists of a two-stage pipeline: (1) a Geometry-Aware Lifting Module that lifts per-frame 2D joints and deep image features to a reliable, kinematically consistent 3D pose anchor sequence; (2) a Motion-Guided Reconstruction Network that regresses full 3D body meshes from these pose anchors, explicitly incorporating both short- and long-range motion. Both stages employ continuous-time SSMs with Mamba parameterizations, solved via zero-order hold discretization: $\frac{dh(t)}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t)$ becoming

$h_t = \bar{A} h_{t-1} + \bar{B} x_t,\qquad y_t = C h_t$

where (A, B, C) are state, input, and output matrices, respectively, and $\bar{A}$ , $\bar{B}$ are ZOH-discretized. Mamba enables efficient global 1D convolutions for sequence modeling.

A unique dual-scan architecture is employed within the SSM blocks: one scan follows the global linear joint or time index; the other is a local traversal (e.g., bone chain in the skeleton tree) to encode biomechanical priors. The outputs are fused multiplicatively after nonlinear transformation.

2. Geometry-Aware Lifting Module

The lifting module generates a stable 3D joint sequence ( $P_{3D} \in \mathbb{R}^{T \times J \times 3}$ ) from 2D input ( $P_{2D}$ , $F_{img}$ ) via a Spatial-Temporal Alignment (STA)-Mamba block:

Spatial Encoding: The fused per-joint feature vector is processed by a spatial Mamba block, aligning intra-frame structure.
Deformable Alignment: Deformable attention samples image features at learned offsets, refining spatial encoding for each joint through a weighted multi-head aggregation.
Temporal Modeling: Temporal Mamba block propagates context through the sequence, furnishing long-range dependencies necessary for coherent tracking over time.
Dual-Scan Block: Each Mamba layer performs both a global scan (linear in joint or time dimension) and a local scan (kinematic tree order), fusing outputs as

$O_{\text{fused}} = \sigma(\mathrm{Conv1D}(O_{\text{global}})) \odot O_{\text{local}},\qquad \sigma = \mathrm{SiLU}$

yielding pose features informed by both global context and local kinematic structure.

3. Motion-Guided Reconstruction Network

The reconstruction network synthesizes mesh vertex sequences ( $V_{\text{mesh}} \in \mathbb{R}^{T \times 6890 \times 3}$ ) using explicit and implicit motion cues from $P_{3D}$ :

Motion Encoding: Explicit motion is the finite difference $M_\mathrm{exp}^{(t)} = P_{3D}^{(t)} - P_{3D}^{(t-1)}$ ; implicit motion is a learnable embedding added to image features.
Motion-Aware Attention: Mesh features are computed via attention, with queries from image features and keys/values from projected explicit and implicit motion: $F_{\mathrm{motion}} = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$
Regression Head: A terminal MLP regresses the full set of mesh vertices from these temporally enhanced representations.

4. Optimization and Loss Functions

Two complementary objectives are used:

Pose-Lifting Loss:

$\mathcal{L}_{\mathrm{pose}} = \mathcal{L}_{3D} + \lambda_t \mathcal{L}_t + \lambda_m \mathcal{L}_m + \lambda_{2D} \mathcal{L}_{2D}$

where $\mathcal{L}_{3D}$ is standard MPJPE, $\mathcal{L}_t$ is temporal consistency, $\mathcal{L}_m$ is velocity error (MPJVE), and $\mathcal{L}_{2D}$ is the 2D joint reprojection loss. Default $\lambda$ values are (0.5, 20, 0.5).

Mesh Loss:

$\mathcal{L}_{\mathrm{mesh}} = \lambda_m \mathcal{L}_{\rm meshV} + \lambda_j \mathcal{L}_{\rm joint3D} + \lambda_n \mathcal{L}_{\rm normal} + \lambda_e \mathcal{L}_{\rm edge}$

with respective weights (1, 1, 0.1, 20).

No explicit bone-length or surface regularization is mandated; geometric consistency is predominantly enforced through architecture and these losses.

5. Experimental Evaluation and Results

HMRMamba was benchmarked on 3DPW, MPI-INF-3DHP, and Human3.6M with standard metrics (MPJPE, PA-MPJPE, MPVPE, and acceleration error):

Method	3DPW MPJPE	PA-MPJPE	MPVPE	MPI-INF MPJPE	H3.6M MPJPE
PMCE	69.5	46.7	84.8	79.7	53.5
ARTS	67.7	46.5	81.4	71.8	51.6
Ours-S	66.9	46.3	81.4	70.1	51.2
Ours-L	64.8	45.5	79.8	68.3	49.3

HMRMamba achieves state-of-the-art on all core metrics, offering lower error with reduced parameters and comparable or reduced FLOPs relative to earlier transformer or SSM approaches.

Several ablation studies demonstrate:

Improved performance with both dual-scan alignment and explicit/implicit motion cues.
Robustness to 2D pose detector choice.
Efficiency in parameter and FLOP count over competing models.

6. Implementation Details

Key parameters and protocol:

Image features from ResNet-50 (SPIN-pretrained); 2D estimator: CPN or ViTPose.
Sequence length $T=16$ , stride 4 frames.
Pose lifting: 3–5 Mamba SSM layers, dimension 256–512.
Mesh reconstruction: additional SSM and attention layers.
Training: Adam(lr= $2 \times 10^{-4}$ ), batch 64, 100+20 epochs, weight decay 0.01 for pose, 0.00 for mesh head.
Implemented on RTX 4090 GPU.

7. Limitations and Future Directions

HMRMamba currently relies on supervised 2D/3D correspondence and does not explicitly enforce bone-length or surface-based losses. Potential research avenues include:

Incorporation of explicit kinematic/surface regularizers.
Extension to unsupervised or weakly supervised and real-time or mobile HMR.
Further exploration of SSM compression and task-adaptive temporal modeling (Chen et al., 29 Jan 2026).

References

"Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery" (Chen et al., 29 Jan 2026)

Markdown Upgrade to Chat

References (1)

Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometry-Aware Lifting Module.

Geometry-Aware Lifting Module for 3D HMR

1. Architectural Principles and Design

2. Geometry-Aware Lifting Module

3. Motion-Guided Reconstruction Network

4. Optimization and Loss Functions

5. Experimental Evaluation and Results

6. Implementation Details

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Geometry-Aware Lifting Module for 3D HMR

1. Architectural Principles and Design

2. Geometry-Aware Lifting Module

3. Motion-Guided Reconstruction Network

4. Optimization and Loss Functions

5. Experimental Evaluation and Results

6. Implementation Details

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research