Geometry-Aware Lifting Module for 3D HMR
- Geometry-Aware Lifting Module is a technique that lifts 2D joints and image features to coherent, kinematically consistent 3D pose sequences using state-space models.
- It employs a dual-scan design that fuses global temporal context with local kinematic structure to ensure robust spatial-temporal feature alignment.
- Empirical evaluations demonstrate state-of-the-art improvements in MPJPE, PA-MPJPE, and acceleration errors compared to traditional transformer or RNN-based HMR methods.
HMRMamba is a state-space model (SSM) based architecture for video-based 3D human mesh recovery (HMR), specifically designed to address limitations of prior approaches in temporal coherence, long-range dependency capture, and physically plausible 3D reconstruction. Unlike conventional transformer or RNN-based HMR pipelines, HMRMamba leverages Mamba-parameterized SSMs to enable efficient, long-range temporal modeling, and introduces a geometry-aware dual-scan mechanism for robust pose-anchored mesh reconstruction. The framework is empirically validated as state-of-the-art on major HMR benchmarks (Chen et al., 29 Jan 2026).
1. Architectural Principles and Design
HMRMamba consists of a two-stage pipeline: (1) a Geometry-Aware Lifting Module that lifts per-frame 2D joints and deep image features to a reliable, kinematically consistent 3D pose anchor sequence; (2) a Motion-Guided Reconstruction Network that regresses full 3D body meshes from these pose anchors, explicitly incorporating both short- and long-range motion. Both stages employ continuous-time SSMs with Mamba parameterizations, solved via zero-order hold discretization: becoming
where (A, B, C) are state, input, and output matrices, respectively, and , are ZOH-discretized. Mamba enables efficient global 1D convolutions for sequence modeling.
A unique dual-scan architecture is employed within the SSM blocks: one scan follows the global linear joint or time index; the other is a local traversal (e.g., bone chain in the skeleton tree) to encode biomechanical priors. The outputs are fused multiplicatively after nonlinear transformation.
2. Geometry-Aware Lifting Module
The lifting module generates a stable 3D joint sequence () from 2D input (, ) via a Spatial-Temporal Alignment (STA)-Mamba block:
- Spatial Encoding: The fused per-joint feature vector is processed by a spatial Mamba block, aligning intra-frame structure.
- Deformable Alignment: Deformable attention samples image features at learned offsets, refining spatial encoding for each joint through a weighted multi-head aggregation.
- Temporal Modeling: Temporal Mamba block propagates context through the sequence, furnishing long-range dependencies necessary for coherent tracking over time.
- Dual-Scan Block: Each Mamba layer performs both a global scan (linear in joint or time dimension) and a local scan (kinematic tree order), fusing outputs as
yielding pose features informed by both global context and local kinematic structure.
3. Motion-Guided Reconstruction Network
The reconstruction network synthesizes mesh vertex sequences () using explicit and implicit motion cues from :
- Motion Encoding: Explicit motion is the finite difference ; implicit motion is a learnable embedding added to image features.
- Motion-Aware Attention: Mesh features are computed via attention, with queries from image features and keys/values from projected explicit and implicit motion:
- Regression Head: A terminal MLP regresses the full set of mesh vertices from these temporally enhanced representations.
4. Optimization and Loss Functions
Two complementary objectives are used:
- Pose-Lifting Loss:
where is standard MPJPE, is temporal consistency, is velocity error (MPJVE), and is the 2D joint reprojection loss. Default values are (0.5, 20, 0.5).
- Mesh Loss:
with respective weights (1, 1, 0.1, 20).
No explicit bone-length or surface regularization is mandated; geometric consistency is predominantly enforced through architecture and these losses.
5. Experimental Evaluation and Results
HMRMamba was benchmarked on 3DPW, MPI-INF-3DHP, and Human3.6M with standard metrics (MPJPE, PA-MPJPE, MPVPE, and acceleration error):
| Method | 3DPW MPJPE | PA-MPJPE | MPVPE | MPI-INF MPJPE | H3.6M MPJPE |
|---|---|---|---|---|---|
| PMCE | 69.5 | 46.7 | 84.8 | 79.7 | 53.5 |
| ARTS | 67.7 | 46.5 | 81.4 | 71.8 | 51.6 |
| Ours-S | 66.9 | 46.3 | 81.4 | 70.1 | 51.2 |
| Ours-L | 64.8 | 45.5 | 79.8 | 68.3 | 49.3 |
HMRMamba achieves state-of-the-art on all core metrics, offering lower error with reduced parameters and comparable or reduced FLOPs relative to earlier transformer or SSM approaches.
Several ablation studies demonstrate:
- Improved performance with both dual-scan alignment and explicit/implicit motion cues.
- Robustness to 2D pose detector choice.
- Efficiency in parameter and FLOP count over competing models.
6. Implementation Details
Key parameters and protocol:
- Image features from ResNet-50 (SPIN-pretrained); 2D estimator: CPN or ViTPose.
- Sequence length , stride 4 frames.
- Pose lifting: 3–5 Mamba SSM layers, dimension 256–512.
- Mesh reconstruction: additional SSM and attention layers.
- Training: Adam(lr=), batch 64, 100+20 epochs, weight decay 0.01 for pose, 0.00 for mesh head.
- Implemented on RTX 4090 GPU.
7. Limitations and Future Directions
HMRMamba currently relies on supervised 2D/3D correspondence and does not explicitly enforce bone-length or surface-based losses. Potential research avenues include:
- Incorporation of explicit kinematic/surface regularizers.
- Extension to unsupervised or weakly supervised and real-time or mobile HMR.
- Further exploration of SSM compression and task-adaptive temporal modeling (Chen et al., 29 Jan 2026).
References
- "Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery" (Chen et al., 29 Jan 2026)