Multi-HMR: Multi-Person Mesh Recovery

Updated 11 March 2026

Multi-HMR is a method that jointly detects multiple persons and recovers complete 3D SMPL-X meshes (body, hands, face) in a single feedforward pass.
It leverages a Vision Transformer backbone with cross-attention and patch-level detection to efficiently capture global context and manage occlusions.
By integrating camera ray encoding and synthetic data (CUFFS), Multi-HMR achieves accurate metric 3D reconstruction and high throughput in diverse scenes.

Multi-HMR denotes the class of methods for multi-person whole-body human mesh recovery from a single RGB image. Multi-HMR approaches seek to detect and reconstruct the 3D pose, shape, hand articulation, facial expression, and metric 3D location of all persons in a scene. Unlike earlier multi-stage pipelines relying on sequential detection, cropping, and per-person mesh regression, Multi-HMR achieves full-scene, whole-body, multi-instance mesh regression in a single feedforward pass, enabling high throughput and strong performance without bottlenecks due to person cropping, camera calibration, or body part specialization.

1. Problem Definition and Scope

Multi-HMR targets the task of reconstructing whole-body SMPL-X meshes (body, hands, face) and corresponding 3D translations, optionally in metric camera coordinates and with support for camera intrinsics $K$ , for a variable number $N$ of individuals in a single RGB image. The core system requirements are:

Simultaneous multi-person detection and mesh regression, with no need for cropped person images or cascading detectors.
Prediction of detailed SMPL-X parameters: 3D pose $\theta \in \mathbb{R}^{53 \times 3}$ , shape $\beta \in \mathbb{R}^{10}$ , expression $\psi \in \mathbb{R}^{10}$ , and translation $t \in \mathbb{R}^3$ .
Recovery of global scale and metric localization, either in the absence or presence of known intrinsics.
Scalability to real-world scenes with occlusion, varied subject sizes, and extreme resolutions (far and closeup individuals).
Optional uncertainty modeling and incorporation of prior knowledge (e.g., camera parameters, multi-view correspondence, shape priors) for ambiguity resolution (Baradel et al., 2024, Romain et al., 2024).

This unified formulation is distinct from earlier HMR works focused on per-person, body-only, or monocular single-instance reconstruction.

2. Architecture and Methodological Advances

The core architectural paradigm in state-of-the-art Multi-HMR is a single-stage, transformer-based pipeline leveraging large-scale Vision Transformer (ViT) backbones and cross-attention-based query mechanisms. The canonical pipeline consists of:

ViT Backbone: Processes the input image as $P \times P$ patches, yielding a $D$ -dimensional token per patch with learned positional embeddings (Baradel et al., 2024).
Patch-Level Detection Head: Predicts a heatmap of primary anatomical keypoints (e.g., head), which are used for proposing person locations via thresholding and non-maximum suppression over per-patch scores (Baradel et al., 2024).
Human Perception Head (HPH): For each detected person, forms a query vector from the corresponding patch token, a learned positional query embedding, and mean SMPL-X pose/shape priors. Multiple cross-attention and self-attention layers process these queries jointly, attending to the entire spatial image grid and each other. Each query outputs the full body mesh parameters and 3D translation for its person (Baradel et al., 2024).
Camera Ray Encoding: Optionally encodes camera ray direction for each patch, allowing the model to recover metric depth when intrinsics are provided. Ray Fourier encodings are concatenated to patch tokens and used by downstream attention (Baradel et al., 2024).
CUFFS Dataset Integration: To overcome data scarcity in fine-grained hand/face articulation, synthetic closeup human images with diverse hand poses (CUFFS) are injected into the training corpus and retargeted to SMPL-X meshes (Baradel et al., 2024).

Materials and design choices are summarized in the table below:

Stage	Mechanism	Technical Notes
Backbone	ViT-S/B/L, $P\times P$ grid	Positional + ray encoding optionally
Detection	Head heatmap + offset	CenterNet-style, no crops needed
Mesh Regression	HPH: cross/self-attention	Per-person query, joint inference
Whole-Body Param.	SMPL-X	Pose $\theta$ , shape $\beta$ , expr. $\psi$ , $t$
Synthetic Data Boost	CUFFS dataset, HumGen3D	MANO poses for hands
Camera Awareness	Ray/Fourier embedding	Metric 3D placement

Multi-HMR’s HPH module is fundamental for efficient multi-instance, whole-body mesh regression, directly mapping image-wide features to per-person predictions via attention rather than cropping. Self-attention among queries ensures global context and enables occlusion/ordering reasoning.

3. Training Objectives and Losses

The Multi-HMR learning framework supervises all stages with losses targeting both instance assignment and 3D consistency:

Detection Loss: Sigmoid binary cross-entropy for keypoint heatmap prediction.
Parameter Regression Loss: $L_1$ regression on SMPL-X parameters and predicted translation components.
Mesh Consistency Loss: $L_1$ between predicted mesh/vertex locations and ground-truth 3D mesh locations (with alignment as required).
Reprojection Loss: $L_1$ between projected predicted mesh joints and ground-truth 2D joints.
Camera-Awareness Regularization: If intrinsics are available, camera ray encoding regularizes metric depth estimation, reducing scale ambiguity for different-focal-length scenes (Baradel et al., 2024).
CUFFS augmentation: Adding synthetic close-up training samples specifically improves distal accuracy for hands and faces.

Auxiliary losses such as silhouette or part segmentation may be included in variant implementations; the essential pipeline maintains focus on the pixel-to-SMPL(-X) mapping and direct detection-mesh supervision.

4. Performance Benchmarks and Empirical Findings

Multi-HMR achieves state-of-the-art results across standard benchmarks for both body-only and whole-body mesh recovery. Key quantitative outcomes with the ViT-L/896 backbone are:

3DPW (body-only): PA-MPJPE 41.7 mm, outperforming ROMP, BEV, PSVT baselines.
MuPoTs (PCK-all): 85.0%, exceeding all prior single- and two-stage approaches.
EHF (whole-body, PVE-All/Hands): 44.2 mm / 36.4 mm, besting other methods including ExPose, PIXIE, Hand4Whole, and OSX.
AGORA-SMPLX (PVE-All): 109.3 mm, below previous bests at 122.8–135.5 mm.
Depth Errors (MuPoTs): MRPE 514 mm compared to 1688 mm for ROMP.
Throughput: Real-time inference at 28 ms per image (ViT-S, $448^2$ , $N=1$ ), with compute approximately independent of the number of people.

Empirical ablations reveal:

HPH (cross-attention) vs staged regression: HPH converges and generalizes better.
Self-attention between queries: Improves occlusion reasoning and identity ordering.
Ray encoding: Crucial for transfer across camera setups and for accurate metric reconstructions.
CUFFS: Addition yields substantial improvement in hand metrics on EHF-H (PVE-hands from 51.2 mm to 40.5 mm).
Resolution/backbone trade-off: Higher resolution / larger ViTs predict finer details but at increased latency.

5. Extensions: Bayesian Multi-HMR and Ambiguity Modeling

Recent work extends Multi-HMR to probabilistic modeling of mesh recovery, as in CondiMen. Instead of regressing a single point estimate per person, CondiMen learns a full joint conditional Bayesian network over camera intrinsics, 3D location, SMPL-X pose, shape, and expression. The decoupling is factorized according to domain knowledge—shape affects depth, and both shape and depth influence pose. The conditional densities are Gaussian or Matrix Fisher distributions, with heads conditioned on the outputs of their parents. The model supports efficient maximum-a-posteriori (MAP) extraction for real-time predictions and can clamp known variables (e.g., camera or shape priors) at inference (Romain et al., 2024). This approach:

Explicitly represents aleatoric uncertainty and inherently ambiguous projections (e.g., depth-size tradeoff).
Enables multi-view fusion and shape conditioning at test time by adjusting the clamped variables within the Bayesian network.
Retains parity or improvement in accuracy compared to standard Multi-HMR, with the added benefit of well-calibrated uncertainty (Romain et al., 2024).

6. Comparative Analysis and Relation to Other Multi-HMR Paradigms

Multi-HMR is distinguished from several related domains:

Multi-RoI HMR: Uses multiple overlapping crops and enforces camera consistency and contrastive feature learning, but does not retrieve all instances in a single forward pass, nor is it whole-body SMPL-X (Nie et al., 2024).
DETR-style multi-person HMR: Approaches such as SAT-HMR perform box-free one-stage queries but are body-centric and may sacrifice hand/face accuracy for efficiency (Su et al., 2024).
Model-based pose/mesh regression pipelines: Sequential approaches with explicit cropping, tracking, or separate hand/face refinement offer modularity but are slower and harder to deploy at scale.

Only canonical Multi-HMR achieves single-pass, globally consistent, whole-body SMPL-X mesh recovery for variable $N$ , camera-aware metric inference, and competitive hand/facial reconstruction at scale.

7. Limitations and Future Directions

Multi-HMR exhibits several remaining challenges:

Fine-grained articulation: Direct prediction of hand/facial details is data-limited for distant or low-resolution people, though CUFFS mitigates this for hands (Baradel et al., 2024).
Global metric ordering: Reliant on camera ray encoding quality and accuracy of detected intrinsics.
Occlusion and identity: While self-attention among detection queries enables some occlusion resolution, heavy occlusions and crowded scenes can still degrade per-person mesh quality and ordering fidelity.
Scalability: Very large group scenes stress global context and may increase query collisions, though performance degrades gracefully.
Probabilistic modeling: While CondiMen enables distributional inference, mode extraction remains sequential in the current implementation, potentially limiting speed for high $N$ .

Future work is likely to further integrate multi-view cues at inference (without retraining), exploit more extensive synthetic data for hand/facial articulation, and refine uncertainty-aware modeling for open-set and ambiguous scenarios (Romain et al., 2024).

Key references:

"Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot" (Baradel et al., 2024)
"CondiMen: Conditional Multi-Person Mesh Recovery" (Romain et al., 2024)
"SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens" (Su et al., 2024)
"Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses" (Nie et al., 2024)