HMRMamba: Advanced 3D Mesh Recovery

Updated 5 February 2026

HMRMamba is a state-of-the-art video-based 3D human mesh recovery approach that leverages structured state space models and dual-scan modules.
It integrates geometry-aware lifting and motion-guided reconstruction to overcome challenges like occlusion and ambiguous poses, achieving lower MPJPE and smoother dynamics.
The framework employs efficient temporal encoding and attention mechanisms to deliver coherent 3D mesh outputs with reduced computational overhead.

HMRMamba is a state-of-the-art approach for video-based 3D Human Mesh Recovery (HMR) that leverages structured state space models (SSMs)—specifically, the Mamba architecture—to overcome persistent challenges in prior methods. By integrating geometry-aware and motion-guided modules, HMRMamba achieves robustness and efficiency in reconstructing temporally coherent, physically plausible human meshes from monocular video sequences (Chen et al., 29 Jan 2026).

1. Motivation and Problem Formulation

The domain of 3D human mesh recovery from video demands reconciling two dominant difficulties in prior art: (i) reliance on unreliable intermediate 3D pose anchors, which can propagate errors and lead to physically implausible mesh dynamics; and (ii) inability of existing sequence models (e.g., RNNs, Transformers) to effectively capture complex, long-range spatiotemporal dependencies while maintaining computational efficiency. Standard approaches often fail in the presence of occlusion, motion blur, or ambiguous pose configurations.

HMRMamba introduces a novel two-stage architecture grounded in SSMs. The efficiency and linear-time complexity of SSMs, especially Mamba’s construction, allow for scalable, global temporal modeling and direct encoding of sequential inductive biases, which are crucial in this context.

2. Architectural Overview

The HMRMamba pipeline consists of two distinct modules:

Geometry-Aware Lifting Module: Given a sequence of $T$ video frames $\{I_t\}_{t=1}^T$ , per-frame image features $F_{\mathrm{img}}\in\mathbb R^{T\times D}$ are extracted using a ResNet-50 backbone (with $D=2048$ ). A 2D pose detector produces $P_{2D}\in\mathbb R^{T\times J\times 2}$ . These are fused via an MLP, then processed in a spatial-temporal alignment SSM block termed “STA-Mamba” to output robust 3D joint sequences $P_{3D}\in\mathbb R^{T\times J\times 3}$ .
Motion-Guided Reconstruction Network: Utilizing $P_{3D}$ as a reliable anchor, and concatenated with the original $F_{\mathrm{img}}$ , this stage processes explicit (joint-wise) and implicit (learned) motion sequences for mesh regression. The output is a temporally consistent mesh $V_{\mathrm{mesh}}\in\mathbb R^{T\times N\times 3}$ , with $N=6890$ SMPL vertices.

Both modules leverage Mamba-based SSMs. SSMs are parameterized as discrete-time linear systems (zero-order hold discretization) and implemented as global one-dimensional convolutions for efficient sequence modeling.

3. Core Components and Processing Flow

3.1 Geometry-Aware Lifting Module

Encoder Fusion: Image features $F_{\mathrm{img}}$ and detected 2D joints $P_{2D}$ are fused on a per-joint basis via a small MLP and concatenation.
STA-Mamba: Consists of two sub-blocks:
- Spatial Mamba processes intra-frame (structural) correlations.
- Deformable Attention Alignment refines per-joint features by aggregating features at sampled, offset locations, with learned offsets and weights. Formally, for joint $i$ :
$F'_{\mathrm{spatial}}[i] = \sum_{m=1}^M W_m \left[\sum_{k=1}^K A_{m,i,k}\, \bigl(W'_m v(p_i+\Delta p_{m,i,k})\bigr)\right]$

where $v(\cdot)$ samples image features, $\Delta p$ and $A$ are learned, $M$ denotes attention heads, and $K$ samples per head. - Temporal Mamba encodes sequential dynamics between frames.
Dual-Scan Mamba Block: For each SSM block, both global (sequence-indexed) and local (kinematic tree–ordered) scans are performed. Their outputs are fused as

$O_{\mathrm{fused}} = \sigma(\mathrm{Conv1D}(O_{\mathrm{global}})) \odot O_{\mathrm{local}}$

with $\sigma$ =SiLU and $\odot$ denoting elementwise multiplication.

3.2 Motion-Guided Reconstruction Network

Motion Representations: Explicit motion signal $M_{\mathrm{exp}}^{(t)} = P_{3D}^{(t)} - P_{3D}^{(t-1)}$ ; implicit motion is a correction embedding.
Motion-Aware Attention: Queries are drawn from image features, while keys/values originate from both explicit and implicit motion representations.

$F_{\mathrm{motion}} = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Mesh Regression Head: Decodes $F_{\mathrm{motion}}$ to vertex positions $V_{\mathrm{mesh}}$ .

4. Training, Loss Functions, and Implementation

Pose-Lifting Stage Loss:

$\mathcal{L}_{\mathrm{pose}} = \mathcal{L}_{3D} + \lambda_t \mathcal{L}_t + \lambda_m \mathcal{L}_m + \lambda_{2D} \mathcal{L}_{2D}$

where - $\mathcal{L}_{3D}$ is mean per-joint position error (MPJPE). - $\mathcal{L}_t$ is temporal consistency. - $\mathcal{L}_m$ is mean per-joint velocity error (MPJVE). - $\mathcal{L}_{2D}$ is 2D projection error. - $(\lambda_t,\lambda_m,\lambda_{2D})=(0.5,20,0.5)$ .

Mesh Recovery Loss:

$\mathcal{L}_{\mathrm{mesh}} = \lambda_m \mathcal{L}_{\mathrm{meshV}} + \lambda_j \mathcal{L}_{\mathrm{joint3D}} + \lambda_n \mathcal{L}_{\mathrm{normal}} + \lambda_e \mathcal{L}_{\mathrm{edge}}$

with $(\lambda_m, \lambda_j, \lambda_n, \lambda_e) = (1,1,0.1,20)$ .

Implementation:
- ResNet-50 backbone pretrained on SPIN.
- 2D pose detectors: CPN for Human3.6M, ViTPose for 3DPW/MPI-INF-3DHP.
- Sequence length $T=16$ , stride 4.
- SSM modules: 3-layer (“Ours-S”, 256-dim) or 5-layer (“Ours-L”); co-evolution blocks: 3 layers, dim=64.
- Trained on RTX 4090.

5. Quantitative Results and Ablation Analysis

5.1 Quantitative Results

Performance is reported on 3DPW, MPI-INF-3DHP, Human3.6M. Key metrics include:

MPJPE: mean per-joint position error
PA-MPJPE: Procrustes-aligned MPJPE
MPVPE: per-vertex error
Accel: joint acceleration error

Model	3DPW MPJPE	3DPW PA-MPJPE	H3.6M MPJPE	H3.6M PA-MPJPE	MPI-INF MPJPE
PMCE	69.5	46.7	53.5	37.7	79.7
ARTS	67.7	46.5	51.6	36.6	71.8
HMRMamba-S	66.9	46.3	51.2	36.0	70.1
HMRMamba-L	64.8	45.5	49.3	35.7	68.3

HMRMamba consistently achieves lower (better) errors compared to prior SSM- or Transformer-based approaches, with comparable or reduced computational resources.

5.2 Ablations

2D Pose Detector: Using higher-quality pose detectors (e.g., CPN, ground-truth) improves downstream mesh error. HMRMamba outperforms competitors for any pose detector.
Component Analysis: The combination of geometric alignment and both explicit and implicit motion representations yields the lowest errors, demonstrating the necessity of all submodules.
Efficiency: Ours-S uses 79.6M parameters/7.88GFLOPs (MPJPE=51.2), Ours-L (89M/9.32GFLOPs, MPJPE=49.3), outperforming PMCE and ARTS at comparable model size.

6. Distinctive Innovations and Theoretical Significance

HMRMamba pioneers the structured use of SSMs (in particular, Mamba) for long-sequence modeling in HMR, providing:

Geometry-Aware Lifting: Injection of kinematic priors and spatial context via dual-scan SSMs, yielding stable 3D anchors underpinning the regression process.
Motion-Guided Decoding: Explicit exploitation of both velocity and implicit corrections, mediated by attention, to ensure temporally smooth and plausible mesh reconstructions.
SSM-based Temporal Encoding: Mamba’s SSMs permit linear complexity with respect to sequence length, support global context, and bypass the vanishing/exploding gradient issues of RNNs.

7. Limitations and Future Research

The current instantiation of HMRMamba does not employ explicit bone-length or surface-based geometric constraints, instead relying on implicit kinematic priors embedded in its SSM modules. Suggested directions for future work include extending the approach to unsupervised or weakly supervised contexts, integrating explicit geometry losses, and further compressing SSM layers for real-time or mobile deployment (Chen et al., 29 Jan 2026).

Markdown Upgrade to Chat

References (1)

Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HMRMamba.

HMRMamba: Advanced 3D Mesh Recovery

1. Motivation and Problem Formulation

2. Architectural Overview

3. Core Components and Processing Flow

3.1 Geometry-Aware Lifting Module

3.2 Motion-Guided Reconstruction Network

4. Training, Loss Functions, and Implementation

5. Quantitative Results and Ablation Analysis

5.1 Quantitative Results

5.2 Ablations

6. Distinctive Innovations and Theoretical Significance

7. Limitations and Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

HMRMamba: Advanced 3D Mesh Recovery

1. Motivation and Problem Formulation

2. Architectural Overview

3. Core Components and Processing Flow

3.1 Geometry-Aware Lifting Module

3.2 Motion-Guided Reconstruction Network

4. Training, Loss Functions, and Implementation

5. Quantitative Results and Ablation Analysis

5.1 Quantitative Results

5.2 Ablations

6. Distinctive Innovations and Theoretical Significance

7. Limitations and Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research