HandMCM: Multi-Modal 3D Hand Pose Estimation
- The paper introduces a Mamba-based correspondence state-space model (SSM) that dynamically refines 3D hand keypoint positions.
- HandMCM integrates multi-modal inputs by projecting 2D features from RGB and depth onto 3D super points derived via set convolutions.
- Empirical results show superior performance with reduced error on benchmarks like NYU, DexYCB, and HO3D, especially in occlusion scenarios.
HandMCM refers to the "Multi-modal Point Cloud–based Correspondence State Space Model for 3D Hand Pose Estimation," a framework that introduces a state-space model (SSM), specifically adapted from the Mamba architecture, for robust and accurate 3D hand pose estimation from multi-modal input data. The HandMCM approach directly targets the longstanding challenge of precise hand keypoint estimation under self- and object-occlusions, leveraging deep geometric learning, multi-modal fusion, and dynamic correspondence modeling to set new performance benchmarks across several standard datasets (Cheng et al., 2 Feb 2026).
1. Architectural Foundations and Network Design
HandMCM is architected to process single-hand RGB images (), paired depth maps (), and dense 3D point clouds () sampled from the depth image. The multi-modal super point encoder combines these modalities by:
- Downsampling the input point cloud using a PointNet++-style set convolution, deriving "super points" with features .
- Extracting 2D features from depth and RGB images using respective ResNet-based autoencoders, yielding feature maps and .
- Projecting these 2D features onto the super points via geometric interpolation, producing and .
- Concatenating all modalities on the super point domain: .
Global aggregation via set-convolutions yields a global representation , which is replicated and shifted via a bias-induced layer (BIL) to produce the initial set of per-keypoint tokens and a first estimate of 3D keypoint positions . The main predictive backbone consists of stacked Correspondence Mamba blocks, where each block iteratively refines token representations and regresses updated keypoint 3D coordinates.
2. Correspondence State-Space Modeling
At the core of HandMCM are the bidirectional, gated SSM ("BiGS") blocks inspired by the Mamba architecture, reinterpreted for hand pose correspondence. Instead of modeling keypoints as graph nodes, HandMCM forms a one-dimensional sequence ("scan path") of tokens representing each keypoint.
Within each block , both forward () and backward-reversed () token streams are processed independently by the SSM:
- Tokens are normalized and projected to produce value () and directional streams (, ).
- Each stream is processed by a learned SSM; outputs , are obtained.
- An outer product , projected by , defines a correspondence map encoding dynamic spatial dependencies across keypoints.
- The updated token representation is obtained by linearly mixing according to ; positions are regressed via a shared linear head.
A local token injection and filtering mechanism further enhances estimation: for each keypoint prediction , its nearest super points are retrieved, concatenated with both global and local features, and encoded by a small set-conv network, producing a local token . This local information is injected via multiplicative modified LayerNorm, with a learned gate adaptively blending the global SSM update and local prediction per keypoint.
3. Multi-Modal Feature Fusion Mechanisms
HandMCM distinguishes itself among 3D hand pose architectures by explicitly fusing multiple sensing modalities before correspondence modeling:
- 3D geometric features extracted via point cloud set convolution summarize hand surface geometry.
- 2D spatial and semantic appearance descriptors are extracted from both depth and RGB streams, providing information robust to occlusion and appearance variation.
- Through rigorous spatial alignment (2D→3D projection), per-super-point feature vectors are assembled, unifying local geometry, appearance, and depth cues in a manner that guides both global state-space reasoning and local keypoint refinement.
This rich fused representation provides high resilience to missing or occluded modalities and underpins HandMCM's superior robustness to occlusion scenarios.
4. Training, Optimization, and Loss Structure
HandMCM is trained using a block-wise multi-stage supervision strategy. For each predicted set of keypoints at stage (including the initial estimate ), the loss is computed as the sum over all keypoints of the smooth-L1 discrepancy with the corresponding ground truth :
where
Training employs the AdamW optimizer with hyperparameters , , learning rate , and batch size 32. Data augmentation includes extensive 3D geometric perturbations (random rotations, scaling, and translations) to improve generalization.
5. Empirical Evaluation and Benchmark Analysis
HandMCM's empirical performance has been established across three public datasets:
| Dataset | Input Modalities | Keypoints () | Metric | HandMCM Error | Prior Best (Method) |
|---|---|---|---|---|---|
| NYU | D | 14 | MKE (mm) | 7.06 | 7.12 (HandDAGT) |
| DexYCB | RGBD | 21 | MKE (mm) | 6.67 | 7.54 (K-Fusion) |
| HO3D | RGBD | 21 | MKE (cm) | 1.71 | 1.79 (K-Fusion) |
Results indicate consistently superior accuracy, particularly in challenging occlusion regimes. Ablation studies confirm that the correspondence SSM, together with local token injection and filtering, yields the most substantial gains (from 8.47 mm down to 7.06 mm on NYU). In contrast, standard graph-guided or regular SSMs yield suboptimal performance. Additionally, experiments determine that a depth of three stacked Mamba blocks achieves the best balance between representational capacity and overfitting risk.
6. Algorithmic Pseudocode
A high-level summary of the HandMCM algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
P_prime, F_p = set_conv_downsample(P) F_d = resnet_encoder(D) F_rgb = resnet_encoder(R) F_d_p = interpolate_to_3d(F_d, P_prime) F_rgb_p = interpolate_to_3d(F_rgb, P_prime) F = concat([F_p, F_d_p, F_rgb_p]) G = set_conv_pool(F) X_0 = bias_induced_layer(G) J_0 = linear_head(X_0) for k in range(1, K+1): tilde_X, V, X_f, X_b = preprocess(X_{k-1}) U_f = SSM_fwd(X_f) U_b = SSM_bwd(X_b) M_corr = linear_outer(U_f, reverse(U_b)) X_k_prime = M_corr @ V # Local injection/filtering for j in range(J): X_loc_j = local_token(P_prime, F, J_{k-1, j}, X_{k-1, j}) tilde_X_j = LN(X_{k-1, j} * X_loc_j) G_j = sigmoid(X_loc_j) J_{k, j} = (G_j * X_k_prime[j] + (1-G_j) * X_loc_j) @ W_r loss = sum_over_k_j(smooth_L1(J_{k, j} - gt_j)) backprop_and_update(loss) |
7. Contributions, Limitations, and Future Directions
Key contributions of HandMCM include:
- The introduction of the first Mamba-based correspondence state-space model for 3D hand pose estimation, providing dynamic, context-sensitive modeling of kinematic correspondences.
- A local token injection and filtering scheme, ensuring both global and fine-scale geometric alignment per keypoint.
- Rigorous multi-modal fusion of depth, RGB, and point cloud sources at the super point level, supporting occlusion robustness.
- Demonstration of state-of-the-art accuracy on large-scale hand pose benchmarks through extensive ablation and comparative analysis.
Limitations are primarily in the scope of supported interactions; the current HandMCM operates only in the single-hand (or hand-object) domain, with bi-manual and hand-hand interaction scenarios remaining open. Potential future directions include extending correspondence modeling across hands, dynamic handling of closely interacting entities, and exploration of more parameter- and compute-efficient SSM variants for real-time deployment in AR/VR applications (Cheng et al., 2 Feb 2026).