Papers
Topics
Authors
Recent
Search
2000 character limit reached

HandMCM: Multi-Modal 3D Hand Pose Estimation

Updated 9 February 2026
  • The paper introduces a Mamba-based correspondence state-space model (SSM) that dynamically refines 3D hand keypoint positions.
  • HandMCM integrates multi-modal inputs by projecting 2D features from RGB and depth onto 3D super points derived via set convolutions.
  • Empirical results show superior performance with reduced error on benchmarks like NYU, DexYCB, and HO3D, especially in occlusion scenarios.

HandMCM refers to the "Multi-modal Point Cloud–based Correspondence State Space Model for 3D Hand Pose Estimation," a framework that introduces a state-space model (SSM), specifically adapted from the Mamba architecture, for robust and accurate 3D hand pose estimation from multi-modal input data. The HandMCM approach directly targets the longstanding challenge of precise hand keypoint estimation under self- and object-occlusions, leveraging deep geometric learning, multi-modal fusion, and dynamic correspondence modeling to set new performance benchmarks across several standard datasets (Cheng et al., 2 Feb 2026).

1. Architectural Foundations and Network Design

HandMCM is architected to process single-hand RGB images (RRH×W×3R \in \mathbb{R}^{H \times W \times 3}), paired depth maps (DRH×WD \in \mathbb{R}^{H \times W}), and dense 3D point clouds (PRN×3P \in \mathbb{R}^{N \times 3}) sampled from the depth image. The multi-modal super point encoder combines these modalities by:

  • Downsampling the input point cloud using a PointNet++-style set convolution, deriving N=N/2N' = N/2 "super points" PRN×3P' \in \mathbb{R}^{N' \times 3} with features FpRN×CpF_p \in \mathbb{R}^{N' \times C_p}.
  • Extracting 2D features from depth and RGB images using respective ResNet-based autoencoders, yielding feature maps FdRH/2×W/2×CdF_d \in \mathbb{R}^{H/2\times W/2 \times C_d} and FrgbRH/2×W/2×CrgbF_{rgb} \in \mathbb{R}^{H/2\times W/2 \times C_{rgb}}.
  • Projecting these 2D features onto the super points via geometric interpolation, producing FdpF_{d \rightarrow p} and FrgbpF_{rgb \rightarrow p}.
  • Concatenating all modalities on the super point domain: F=[Fp    Fdp    Frgbp]F = [F_p \;\|\; F_{d \rightarrow p} \;\|\; F_{rgb \rightarrow p} ].

Global aggregation via set-convolutions yields a global representation GG, which is replicated and shifted via a bias-induced layer (BIL) to produce the initial set of per-keypoint tokens X0X_0 and a first estimate of 3D keypoint positions J0J_0. The main predictive backbone consists of KK stacked Correspondence Mamba blocks, where each block iteratively refines token representations and regresses updated keypoint 3D coordinates.

2. Correspondence State-Space Modeling

At the core of HandMCM are the bidirectional, gated SSM ("BiGS") blocks inspired by the Mamba architecture, reinterpreted for hand pose correspondence. Instead of modeling keypoints as graph nodes, HandMCM forms a one-dimensional sequence ("scan path") of tokens representing each keypoint.

Within each block kk, both forward (XfX_f) and backward-reversed (XbX_b) token streams are processed independently by the SSM:

  • Tokens are normalized and projected to produce value (VV) and directional streams (XfX_f, XbX_b).
  • Each stream is processed by a learned SSM; outputs UfU_f, UbU_b are obtained.
  • An outer product UfReverse(Ub)U_f \otimes \mathrm{Reverse}(U_b), projected by WcW_c, defines a correspondence map McorrRJ×JM_{corr} \in \mathbb{R}^{J \times J} encoding dynamic spatial dependencies across keypoints.
  • The updated token representation XkX_k is obtained by linearly mixing VV according to McorrM_{corr}; positions JkJ_k are regressed via a shared linear head.

A local token injection and filtering mechanism further enhances estimation: for each keypoint prediction jk,jj_{k,j}, its KK nearest super points are retrieved, concatenated with both global and local features, and encoded by a small set-conv network, producing a local token xloc,jx_{loc,j}. This local information is injected via multiplicative modified LayerNorm, with a learned gate GG adaptively blending the global SSM update and local prediction per keypoint.

3. Multi-Modal Feature Fusion Mechanisms

HandMCM distinguishes itself among 3D hand pose architectures by explicitly fusing multiple sensing modalities before correspondence modeling:

  • 3D geometric features extracted via point cloud set convolution summarize hand surface geometry.
  • 2D spatial and semantic appearance descriptors are extracted from both depth and RGB streams, providing information robust to occlusion and appearance variation.
  • Through rigorous spatial alignment (2D→3D projection), per-super-point feature vectors are assembled, unifying local geometry, appearance, and depth cues in a manner that guides both global state-space reasoning and local keypoint refinement.

This rich fused representation provides high resilience to missing or occluded modalities and underpins HandMCM's superior robustness to occlusion scenarios.

4. Training, Optimization, and Loss Structure

HandMCM is trained using a block-wise multi-stage supervision strategy. For each predicted set of keypoints JkJ_k at stage kk (including the initial estimate k=0k=0), the loss is computed as the sum over all keypoints of the smooth-L1 discrepancy with the corresponding ground truth jjj^*_j:

L=k=0Kj=1JsmoothL1(jk,jjj)\mathcal{L} = \sum_{k=0}^K \sum_{j=1}^J \mathrm{smooth}_{L1}\left(j_{k,j} - j^*_j\right)

where

smoothL1(x)={0.5x,x<0.01 x0.005,otherwise\mathrm{smooth}_{L1}(x) = \begin{cases} 0.5|x|, & |x|<0.01 \ |x|-0.005, & \text{otherwise} \end{cases}

Training employs the AdamW optimizer with hyperparameters β1=0.5\beta_1=0.5, β2=0.999\beta_2=0.999, learning rate 1 ⁣× ⁣1031\!\times\!10^{-3}, and batch size 32. Data augmentation includes extensive 3D geometric perturbations (random rotations, scaling, and translations) to improve generalization.

5. Empirical Evaluation and Benchmark Analysis

HandMCM's empirical performance has been established across three public datasets:

Dataset Input Modalities Keypoints (JJ) Metric HandMCM Error Prior Best (Method)
NYU D 14 MKE (mm) 7.06 7.12 (HandDAGT)
DexYCB RGBD 21 MKE (mm) 6.67 7.54 (K-Fusion)
HO3D RGBD 21 MKE (cm) 1.71 1.79 (K-Fusion)

Results indicate consistently superior accuracy, particularly in challenging occlusion regimes. Ablation studies confirm that the correspondence SSM, together with local token injection and filtering, yields the most substantial gains (from 8.47 mm down to 7.06 mm on NYU). In contrast, standard graph-guided or regular SSMs yield suboptimal performance. Additionally, experiments determine that a depth of three stacked Mamba blocks achieves the best balance between representational capacity and overfitting risk.

6. Algorithmic Pseudocode

A high-level summary of the HandMCM algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
P_prime, F_p = set_conv_downsample(P)

F_d = resnet_encoder(D)
F_rgb = resnet_encoder(R)

F_d_p = interpolate_to_3d(F_d, P_prime)
F_rgb_p = interpolate_to_3d(F_rgb, P_prime)

F = concat([F_p, F_d_p, F_rgb_p])

G = set_conv_pool(F)
X_0 = bias_induced_layer(G)
J_0 = linear_head(X_0)

for k in range(1, K+1):
    tilde_X, V, X_f, X_b = preprocess(X_{k-1})
    U_f = SSM_fwd(X_f)
    U_b = SSM_bwd(X_b)
    M_corr = linear_outer(U_f, reverse(U_b))
    X_k_prime = M_corr @ V

    # Local injection/filtering
    for j in range(J):
        X_loc_j = local_token(P_prime, F, J_{k-1, j}, X_{k-1, j})
        tilde_X_j = LN(X_{k-1, j} * X_loc_j)
        G_j = sigmoid(X_loc_j)
        J_{k, j} = (G_j * X_k_prime[j] + (1-G_j) * X_loc_j) @ W_r

loss = sum_over_k_j(smooth_L1(J_{k, j} - gt_j))
backprop_and_update(loss)

7. Contributions, Limitations, and Future Directions

Key contributions of HandMCM include:

  1. The introduction of the first Mamba-based correspondence state-space model for 3D hand pose estimation, providing dynamic, context-sensitive modeling of kinematic correspondences.
  2. A local token injection and filtering scheme, ensuring both global and fine-scale geometric alignment per keypoint.
  3. Rigorous multi-modal fusion of depth, RGB, and point cloud sources at the super point level, supporting occlusion robustness.
  4. Demonstration of state-of-the-art accuracy on large-scale hand pose benchmarks through extensive ablation and comparative analysis.

Limitations are primarily in the scope of supported interactions; the current HandMCM operates only in the single-hand (or hand-object) domain, with bi-manual and hand-hand interaction scenarios remaining open. Potential future directions include extending correspondence modeling across hands, dynamic handling of closely interacting entities, and exploration of more parameter- and compute-efficient SSM variants for real-time deployment in AR/VR applications (Cheng et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HandMCM.