HandMCM: Multi-Modal 3D Hand Pose Estimation

Updated 9 February 2026

The paper introduces a Mamba-based correspondence state-space model (SSM) that dynamically refines 3D hand keypoint positions.
HandMCM integrates multi-modal inputs by projecting 2D features from RGB and depth onto 3D super points derived via set convolutions.
Empirical results show superior performance with reduced error on benchmarks like NYU, DexYCB, and HO3D, especially in occlusion scenarios.

HandMCM refers to the "Multi-modal Point Cloud–based Correspondence State Space Model for 3D Hand Pose Estimation," a framework that introduces a state-space model (SSM), specifically adapted from the Mamba architecture, for robust and accurate 3D hand pose estimation from multi-modal input data. The HandMCM approach directly targets the longstanding challenge of precise hand keypoint estimation under self- and object-occlusions, leveraging deep geometric learning, multi-modal fusion, and dynamic correspondence modeling to set new performance benchmarks across several standard datasets (Cheng et al., 2 Feb 2026).

1. Architectural Foundations and Network Design

HandMCM is architected to process single-hand RGB images ( $R \in \mathbb{R}^{H \times W \times 3}$ ), paired depth maps ( $D \in \mathbb{R}^{H \times W}$ ), and dense 3D point clouds ( $P \in \mathbb{R}^{N \times 3}$ ) sampled from the depth image. The multi-modal super point encoder combines these modalities by:

Downsampling the input point cloud using a PointNet++-style set convolution, deriving $N' = N/2$ "super points" $P' \in \mathbb{R}^{N' \times 3}$ with features $F_p \in \mathbb{R}^{N' \times C_p}$ .
Extracting 2D features from depth and RGB images using respective ResNet-based autoencoders, yielding feature maps $F_d \in \mathbb{R}^{H/2\times W/2 \times C_d}$ and $F_{rgb} \in \mathbb{R}^{H/2\times W/2 \times C_{rgb}}$ .
Projecting these 2D features onto the super points via geometric interpolation, producing $F_{d \rightarrow p}$ and $F_{rgb \rightarrow p}$ .
Concatenating all modalities on the super point domain: $F = [F_p \;\|\; F_{d \rightarrow p} \;\|\; F_{rgb \rightarrow p} ]$ .

Global aggregation via set-convolutions yields a global representation $G$ , which is replicated and shifted via a bias-induced layer (BIL) to produce the initial set of per-keypoint tokens $X_0$ and a first estimate of 3D keypoint positions $J_0$ . The main predictive backbone consists of $K$ stacked Correspondence Mamba blocks, where each block iteratively refines token representations and regresses updated keypoint 3D coordinates.

2. Correspondence State-Space Modeling

At the core of HandMCM are the bidirectional, gated SSM ("BiGS") blocks inspired by the Mamba architecture, reinterpreted for hand pose correspondence. Instead of modeling keypoints as graph nodes, HandMCM forms a one-dimensional sequence ("scan path") of tokens representing each keypoint.

Within each block $k$ , both forward ( $X_f$ ) and backward-reversed ( $X_b$ ) token streams are processed independently by the SSM:

Tokens are normalized and projected to produce value ( $V$ ) and directional streams ( $X_f$ , $X_b$ ).
Each stream is processed by a learned SSM; outputs $U_f$ , $U_b$ are obtained.
An outer product $U_f \otimes \mathrm{Reverse}(U_b)$ , projected by $W_c$ , defines a correspondence map $M_{corr} \in \mathbb{R}^{J \times J}$ encoding dynamic spatial dependencies across keypoints.
The updated token representation $X_k$ is obtained by linearly mixing $V$ according to $M_{corr}$ ; positions $J_k$ are regressed via a shared linear head.

A local token injection and filtering mechanism further enhances estimation: for each keypoint prediction $j_{k,j}$ , its $K$ nearest super points are retrieved, concatenated with both global and local features, and encoded by a small set-conv network, producing a local token $x_{loc,j}$ . This local information is injected via multiplicative modified LayerNorm, with a learned gate $G$ adaptively blending the global SSM update and local prediction per keypoint.

HandMCM distinguishes itself among 3D hand pose architectures by explicitly fusing multiple sensing modalities before correspondence modeling:

3D geometric features extracted via point cloud set convolution summarize hand surface geometry.
2D spatial and semantic appearance descriptors are extracted from both depth and RGB streams, providing information robust to occlusion and appearance variation.
Through rigorous spatial alignment (2D→3D projection), per-super-point feature vectors are assembled, unifying local geometry, appearance, and depth cues in a manner that guides both global state-space reasoning and local keypoint refinement.

This rich fused representation provides high resilience to missing or occluded modalities and underpins HandMCM's superior robustness to occlusion scenarios.

4. Training, Optimization, and Loss Structure

HandMCM is trained using a block-wise multi-stage supervision strategy. For each predicted set of keypoints $J_k$ at stage $k$ (including the initial estimate $k=0$ ), the loss is computed as the sum over all keypoints of the smooth-L1 discrepancy with the corresponding ground truth $j^*_j$ :

$\mathcal{L} = \sum_{k=0}^K \sum_{j=1}^J \mathrm{smooth}_{L1}\left(j_{k,j} - j^*_j\right)$

where

$\mathrm{smooth}_{L1}(x) = \begin{cases} 0.5|x|, & |x|<0.01 \ |x|-0.005, & \text{otherwise} \end{cases}$

Training employs the AdamW optimizer with hyperparameters $\beta_1=0.5$ , $\beta_2=0.999$ , learning rate $1\!\times\!10^{-3}$ , and batch size 32. Data augmentation includes extensive 3D geometric perturbations (random rotations, scaling, and translations) to improve generalization.

5. Empirical Evaluation and Benchmark Analysis

HandMCM's empirical performance has been established across three public datasets:

Dataset	Input Modalities	Keypoints ( $J$ )	Metric	HandMCM Error	Prior Best (Method)
NYU	D	14	MKE (mm)	7.06	7.12 (HandDAGT)
DexYCB	RGBD	21	MKE (mm)	6.67	7.54 (K-Fusion)
HO3D	RGBD	21	MKE (cm)	1.71	1.79 (K-Fusion)

Results indicate consistently superior accuracy, particularly in challenging occlusion regimes. Ablation studies confirm that the correspondence SSM, together with local token injection and filtering, yields the most substantial gains (from 8.47 mm down to 7.06 mm on NYU). In contrast, standard graph-guided or regular SSMs yield suboptimal performance. Additionally, experiments determine that a depth of three stacked Mamba blocks achieves the best balance between representational capacity and overfitting risk.

6. Algorithmic Pseudocode

A high-level summary of the HandMCM algorithm:

P_prime, F_p = set_conv_downsample(P)

F_d = resnet_encoder(D)
F_rgb = resnet_encoder(R)

F_d_p = interpolate_to_3d(F_d, P_prime)
F_rgb_p = interpolate_to_3d(F_rgb, P_prime)

F = concat([F_p, F_d_p, F_rgb_p])

G = set_conv_pool(F)
X_0 = bias_induced_layer(G)
J_0 = linear_head(X_0)

for k in range(1, K+1):
    tilde_X, V, X_f, X_b = preprocess(X_{k-1})
    U_f = SSM_fwd(X_f)
    U_b = SSM_bwd(X_b)
    M_corr = linear_outer(U_f, reverse(U_b))
    X_k_prime = M_corr @ V

    # Local injection/filtering
    for j in range(J):
        X_loc_j = local_token(P_prime, F, J_{k-1, j}, X_{k-1, j})
        tilde_X_j = LN(X_{k-1, j} * X_loc_j)
        G_j = sigmoid(X_loc_j)
        J_{k, j} = (G_j * X_k_prime[j] + (1-G_j) * X_loc_j) @ W_r

loss = sum_over_k_j(smooth_L1(J_{k, j} - gt_j))
backprop_and_update(loss)

7. Contributions, Limitations, and Future Directions

Key contributions of HandMCM include:

The introduction of the first Mamba-based correspondence state-space model for 3D hand pose estimation, providing dynamic, context-sensitive modeling of kinematic correspondences.
A local token injection and filtering scheme, ensuring both global and fine-scale geometric alignment per keypoint.
Rigorous multi-modal fusion of depth, RGB, and point cloud sources at the super point level, supporting occlusion robustness.
Demonstration of state-of-the-art accuracy on large-scale hand pose benchmarks through extensive ablation and comparative analysis.

Limitations are primarily in the scope of supported interactions; the current HandMCM operates only in the single-hand (or hand-object) domain, with bi-manual and hand-hand interaction scenarios remaining open. Potential future directions include extending correspondence modeling across hands, dynamic handling of closely interacting entities, and exploration of more parameter- and compute-efficient SSM variants for real-time deployment in AR/VR applications (Cheng et al., 2 Feb 2026).

Markdown Upgrade to Chat

References (1)

HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HandMCM.

HandMCM: Multi-Modal 3D Hand Pose Estimation

1. Architectural Foundations and Network Design

2. Correspondence State-Space Modeling

4. Training, Optimization, and Loss Structure

5. Empirical Evaluation and Benchmark Analysis

6. Algorithmic Pseudocode

7. Contributions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

HandMCM: Multi-Modal 3D Hand Pose Estimation

1. Architectural Foundations and Network Design

2. Correspondence State-Space Modeling

3. Multi-Modal Feature Fusion Mechanisms

4. Training, Optimization, and Loss Structure

5. Empirical Evaluation and Benchmark Analysis

6. Algorithmic Pseudocode

7. Contributions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research