Papers
Topics
Authors
Recent
2000 character limit reached

HMRMamba: Advanced 3D Mesh Recovery

Updated 5 February 2026
  • HMRMamba is a state-of-the-art video-based 3D human mesh recovery approach that leverages structured state space models and dual-scan modules.
  • It integrates geometry-aware lifting and motion-guided reconstruction to overcome challenges like occlusion and ambiguous poses, achieving lower MPJPE and smoother dynamics.
  • The framework employs efficient temporal encoding and attention mechanisms to deliver coherent 3D mesh outputs with reduced computational overhead.

HMRMamba is a state-of-the-art approach for video-based 3D Human Mesh Recovery (HMR) that leverages structured state space models (SSMs)—specifically, the Mamba architecture—to overcome persistent challenges in prior methods. By integrating geometry-aware and motion-guided modules, HMRMamba achieves robustness and efficiency in reconstructing temporally coherent, physically plausible human meshes from monocular video sequences (Chen et al., 29 Jan 2026).

1. Motivation and Problem Formulation

The domain of 3D human mesh recovery from video demands reconciling two dominant difficulties in prior art: (i) reliance on unreliable intermediate 3D pose anchors, which can propagate errors and lead to physically implausible mesh dynamics; and (ii) inability of existing sequence models (e.g., RNNs, Transformers) to effectively capture complex, long-range spatiotemporal dependencies while maintaining computational efficiency. Standard approaches often fail in the presence of occlusion, motion blur, or ambiguous pose configurations.

HMRMamba introduces a novel two-stage architecture grounded in SSMs. The efficiency and linear-time complexity of SSMs, especially Mamba’s construction, allow for scalable, global temporal modeling and direct encoding of sequential inductive biases, which are crucial in this context.

2. Architectural Overview

The HMRMamba pipeline consists of two distinct modules:

  1. Geometry-Aware Lifting Module: Given a sequence of TT video frames {It}t=1T\{I_t\}_{t=1}^T, per-frame image features FimgRT×DF_{\mathrm{img}}\in\mathbb R^{T\times D} are extracted using a ResNet-50 backbone (with D=2048D=2048). A 2D pose detector produces P2DRT×J×2P_{2D}\in\mathbb R^{T\times J\times 2}. These are fused via an MLP, then processed in a spatial-temporal alignment SSM block termed “STA-Mamba” to output robust 3D joint sequences P3DRT×J×3P_{3D}\in\mathbb R^{T\times J\times 3}.
  2. Motion-Guided Reconstruction Network: Utilizing P3DP_{3D} as a reliable anchor, and concatenated with the original FimgF_{\mathrm{img}}, this stage processes explicit (joint-wise) and implicit (learned) motion sequences for mesh regression. The output is a temporally consistent mesh VmeshRT×N×3V_{\mathrm{mesh}}\in\mathbb R^{T\times N\times 3}, with N=6890N=6890 SMPL vertices.

Both modules leverage Mamba-based SSMs. SSMs are parameterized as discrete-time linear systems (zero-order hold discretization) and implemented as global one-dimensional convolutions for efficient sequence modeling.

3. Core Components and Processing Flow

3.1 Geometry-Aware Lifting Module

  • Encoder Fusion: Image features FimgF_{\mathrm{img}} and detected 2D joints P2DP_{2D} are fused on a per-joint basis via a small MLP and concatenation.
  • STA-Mamba: Consists of two sub-blocks:
    • Spatial Mamba processes intra-frame (structural) correlations.
    • Deformable Attention Alignment refines per-joint features by aggregating features at sampled, offset locations, with learned offsets and weights. Formally, for joint ii:

    Fspatial[i]=m=1MWm[k=1KAm,i,k(Wmv(pi+Δpm,i,k))]F'_{\mathrm{spatial}}[i] = \sum_{m=1}^M W_m \left[\sum_{k=1}^K A_{m,i,k}\, \bigl(W'_m v(p_i+\Delta p_{m,i,k})\bigr)\right]

    where v()v(\cdot) samples image features, Δp\Delta p and AA are learned, MM denotes attention heads, and KK samples per head. - Temporal Mamba encodes sequential dynamics between frames.

  • Dual-Scan Mamba Block: For each SSM block, both global (sequence-indexed) and local (kinematic tree–ordered) scans are performed. Their outputs are fused as

Ofused=σ(Conv1D(Oglobal))OlocalO_{\mathrm{fused}} = \sigma(\mathrm{Conv1D}(O_{\mathrm{global}})) \odot O_{\mathrm{local}}

with σ\sigma=SiLU and \odot denoting elementwise multiplication.

3.2 Motion-Guided Reconstruction Network

  • Motion Representations: Explicit motion signal Mexp(t)=P3D(t)P3D(t1)M_{\mathrm{exp}}^{(t)} = P_{3D}^{(t)} - P_{3D}^{(t-1)}; implicit motion is a correction embedding.

  • Motion-Aware Attention: Queries are drawn from image features, while keys/values originate from both explicit and implicit motion representations.

Fmotion=softmax(QKTdk)VF_{\mathrm{motion}} = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  • Mesh Regression Head: Decodes FmotionF_{\mathrm{motion}} to vertex positions VmeshV_{\mathrm{mesh}}.

4. Training, Loss Functions, and Implementation

  • Pose-Lifting Stage Loss:

Lpose=L3D+λtLt+λmLm+λ2DL2D\mathcal{L}_{\mathrm{pose}} = \mathcal{L}_{3D} + \lambda_t \mathcal{L}_t + \lambda_m \mathcal{L}_m + \lambda_{2D} \mathcal{L}_{2D}

where - L3D\mathcal{L}_{3D} is mean per-joint position error (MPJPE). - Lt\mathcal{L}_t is temporal consistency. - Lm\mathcal{L}_m is mean per-joint velocity error (MPJVE). - L2D\mathcal{L}_{2D} is 2D projection error. - (λt,λm,λ2D)=(0.5,20,0.5)(\lambda_t,\lambda_m,\lambda_{2D})=(0.5,20,0.5).

  • Mesh Recovery Loss:

Lmesh=λmLmeshV+λjLjoint3D+λnLnormal+λeLedge\mathcal{L}_{\mathrm{mesh}} = \lambda_m \mathcal{L}_{\mathrm{meshV}} + \lambda_j \mathcal{L}_{\mathrm{joint3D}} + \lambda_n \mathcal{L}_{\mathrm{normal}} + \lambda_e \mathcal{L}_{\mathrm{edge}}

with (λm,λj,λn,λe)=(1,1,0.1,20)(\lambda_m, \lambda_j, \lambda_n, \lambda_e) = (1,1,0.1,20).

  • Implementation:
    • ResNet-50 backbone pretrained on SPIN.
    • 2D pose detectors: CPN for Human3.6M, ViTPose for 3DPW/MPI-INF-3DHP.
    • Sequence length T=16T=16, stride 4.
    • SSM modules: 3-layer (“Ours-S”, 256-dim) or 5-layer (“Ours-L”); co-evolution blocks: 3 layers, dim=64.
    • Trained on RTX 4090.

5. Quantitative Results and Ablation Analysis

5.1 Quantitative Results

Performance is reported on 3DPW, MPI-INF-3DHP, Human3.6M. Key metrics include:

  • MPJPE: mean per-joint position error
  • PA-MPJPE: Procrustes-aligned MPJPE
  • MPVPE: per-vertex error
  • Accel: joint acceleration error
Model 3DPW MPJPE 3DPW PA-MPJPE H3.6M MPJPE H3.6M PA-MPJPE MPI-INF MPJPE
PMCE 69.5 46.7 53.5 37.7 79.7
ARTS 67.7 46.5 51.6 36.6 71.8
HMRMamba-S 66.9 46.3 51.2 36.0 70.1
HMRMamba-L 64.8 45.5 49.3 35.7 68.3

HMRMamba consistently achieves lower (better) errors compared to prior SSM- or Transformer-based approaches, with comparable or reduced computational resources.

5.2 Ablations

  • 2D Pose Detector: Using higher-quality pose detectors (e.g., CPN, ground-truth) improves downstream mesh error. HMRMamba outperforms competitors for any pose detector.
  • Component Analysis: The combination of geometric alignment and both explicit and implicit motion representations yields the lowest errors, demonstrating the necessity of all submodules.
  • Efficiency: Ours-S uses 79.6M parameters/7.88GFLOPs (MPJPE=51.2), Ours-L (89M/9.32GFLOPs, MPJPE=49.3), outperforming PMCE and ARTS at comparable model size.

6. Distinctive Innovations and Theoretical Significance

HMRMamba pioneers the structured use of SSMs (in particular, Mamba) for long-sequence modeling in HMR, providing:

  • Geometry-Aware Lifting: Injection of kinematic priors and spatial context via dual-scan SSMs, yielding stable 3D anchors underpinning the regression process.
  • Motion-Guided Decoding: Explicit exploitation of both velocity and implicit corrections, mediated by attention, to ensure temporally smooth and plausible mesh reconstructions.
  • SSM-based Temporal Encoding: Mamba’s SSMs permit linear complexity with respect to sequence length, support global context, and bypass the vanishing/exploding gradient issues of RNNs.

7. Limitations and Future Research

The current instantiation of HMRMamba does not employ explicit bone-length or surface-based geometric constraints, instead relying on implicit kinematic priors embedded in its SSM modules. Suggested directions for future work include extending the approach to unsupervised or weakly supervised contexts, integrating explicit geometry losses, and further compressing SSM layers for real-time or mobile deployment (Chen et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HMRMamba.