BlazePose Pipeline: Real-Time Pose Estimation

Updated 9 March 2026

BlazePose Pipeline is a lightweight, real-time human pose estimation system that uses cascade CNNs for efficient on-device tracking.
It employs multi-stage processing—including person detection, ROI alignment, and temporal smoothing—to ensure robust 2D/3D landmark detection.
The system integrates body and hand keypoints via an MLP-Mixer and statistical modeling to generate precise full-body meshes for applications like AR and motion capture.

BlazePose is a family of lightweight, real-time human pose estimation pipelines, architected specifically for efficient on-device inference. The pipeline achieves robust single-person 2D (and, in later extensions, 3D) body landmark detection, tracking, and holistic full-body modeling using compact convolutional neural network (CNN) architectures. Notable for simultaneously delivering high-accuracy and real-time performance (≥30 FPS) on commodity mobile hardware, BlazePose leverages a cascade of neural network modules for pose detection, temporal tracking, spatial alignment, and both 2D and 3D keypoint regression. The pipeline has been generalized in the BlazePose GHUM Holistic variant to simultaneously provide 3D body, hand, and facial landmark estimates in a unified representation, powered by advanced deep learning and statistical body modeling techniques (Bazarevsky et al., 2020, Grishchenko et al., 2022).

1. System Architecture and Pipeline Staging

BlazePose’s architecture is structured as a multi-stage pipeline optimized for single-person, on-device pose tracking. In the original 2D variant (Bazarevsky et al., 2020), the pipeline is composed of:

Stage 1: Person Detection and Initialization A rapid person-proposal mechanism is based on the BlazeFace face detector, augmented to regress the estimated hip midpoint, body scale (as the diameter of a circumscribing circle), and an inclination angle defined by the mid-shoulder–mid-hip vector. The detector operates on the full image initially and re-invokes if the tracker loses the subject.
Stage 2: Cropping, Normalization, and Alignment Using five regressed values (face bounding box, hip midpoint, scale, inclination), a square region of interest (ROI) is cropped, centered at the predicted hip midpoint, scaled to the circumscribing circle, and rotated to align the mid-shoulders–mid-hips axis vertically. The ROI is resized to 256×256 and normalized to [–1, 1] RGB.
Stage 3: Tracker Refinement and Landmark Estimation For subsequent frames, the tracker network receives the previous ROI’s transformation and applies it to the current input, predicting 33 keypoint coordinates, visibility scores, and a new ROI transform per frame.
Stage 4: Feedback and Fallback If the person-presence confidence (a global tracker flag) falls below threshold, the detector is re-invoked for re-acquisition.

The GHUM Holistic extension (Grishchenko et al., 2022) replaces and augments this structure with:

End-to-End Data Flow: A single 256×256 CNN serves as a body detector/tracker, simultaneous 2D/3D landmark regressor, and palm seed point generator.
Hand Tracking Module: Four body palm seeds define affine crops for each wrist; a dedicated recrop CNN matches ROI geometry to that used in standalone hand models, and MediaPipe Hands supplies 21 2D+3D landmarks per hand.
Full-Body Lifter Stage: All landmarks are composed into a single token sequence for a MLP-Mixer network, which predicts GHUM statistical body model parameters, producing the full body mesh in metric 3D camera coordinates.

2. Neural Network Components

2D Tracker Network and Encoder–Decoder (BlazePose v1)

The 2D tracker module features a cascade network:

Encoder Path (“heatmap path”):
- Conv (3×3, stride 2, 32 ch, ReLU)
- Conv (3×3, stride 1, 64 ch, ReLU)
- Several mobile-friendly residual blocks
- Bottleneck features (~256 channels at 16×16 spatial)
Decoder Path:
- UpConv stages with skip connections
- Heatmap head: 33 channels, 1×1 conv, linear activation
- Offset head: 33×2 channels, 1×1 conv, linear activation
Regression Encoder:
- Concatenated bottleneck and decoder features, processed by small conv stacks
- Outputs: 33×2 normalized keypoint coordinates, 33 visibility scores, and 4 ROI-delta values
- The regression loss is isolated from propagating gradients into the heatmap backbone.

3D Network Components (GHUM Holistic)

Regression Head: Outputs 33×3 keypoint estimates via a small convolutional head (activation functions presumed ReLU).
Recrop Network: A shallow CNN that refines wrist crops for hand module compatibility.
Hand Landmark Model: MediaPipe Hands utilizes a U-Net/hourglass stack for 21 keypoints per hand.
MLP-Mixer Lifter: Accepts 75×3 tokens (33 body, 21×2 hands) and predicts GHUM model parameters ( $\beta$ for shape, $\theta$ for pose, $r$ for root rotation, $t$ for translation). Each 3D point is projected via a shared linear mapping to C hidden dimensions, passed through $L$ Mixer layers (token-mixing and channel-mixing MLPs with skip connections), and final heads for each predicted quantity.

3. Preprocessing, Keypoint Decoding, and Temporal Smoothing

Preprocessing: ROIs are resized to 256×256 RGB and normalized. For each new frame, the ROI transformation of the prior frame maintains temporal stability.
Augmentation (training only): ±10% random scale/translation, synthetic occlusion, color jitter.
Keypoint Decoding: During training, heatmaps provide coarse supervision, and offset maps refine sub-pixel localization. Inference discards heatmap/offset heads; the regression encoder alone emits 33×2 normalized coordinates in crop space. These are mapped back to original image coordinates using the inverse ROI transformation.
Temporal Smoothing: Exponential moving average over 3–5 frames is optionally applied to reduce jitter in landmark position time series.
GHUM Holistic-specific Postprocessing: Body outputs are mapped to local camera space (hip-centered); hand landmarks are additionally rooted at each wrist. Smoothing of GHUM parameters is not specified.

4. Training Losses and Ground Truth Acquisition

BlazePose (2D)

Training employs a composite loss:

Heatmap MSE:

$L_\text{heat} = \frac{1}{N}\,\sum_{n=1}^N\,\sum_{i=1}^{33}\,\sum_{u,v} (H_i^{(n)}(u,v) - \hat H_i^{(n)}(u,v))^2$

Offset L2:

$L_\text{offset} = \frac{1}{N}\,\sum_{n=1}^N\,\sum_{i=1}^{33}\,\sum_{u,v} \|O_i^{(n)}(u,v) - \hat O_i^{(n)}(u,v)\|_2^2$

Regression L2:

$L_\text{reg} = \frac{1}{N}\,\sum_{n=1}^N\,\sum_{i=1}^{33} \|(x_i^{(n)},y_i^{(n)}) - (\hat x_i^{(n)},\hat y_i^{(n)})\|_2^2$

Visibility: Binary cross-entropy
Total loss: $L_\text{total} = L_\text{heat} + \alpha L_\text{offset} + \beta L_\text{reg} + \ldots$ (with typical values $\alpha=1$ , $\beta=5$ ; additional terms as appropriate).

GHUM Holistic (3D and Full-Body)

2D/3D Keypoint L2:

$L_{2D} = \sum_i \|\hat{x}_i - x_i\|_2^2$ , $L_{3D} = \sum_i \|\hat{z}_i - z_i\|_2^2$

Silhouette Alignment: $\ell_1$ loss between predicted and ground truth masks (during GHUM fitting).
Prior Losses: $L_\text{reg} = \|\theta\|_2^2 + \|\beta\|_2^2$ (shape and pose regularization).
Depth Ordering: Hinge penalty for violation of human-annotated joint ordering.
Lifter Training: MSE over predicted and synthetic GHUM parameters from noisy landmark inputs.

Ground Truth Acquisition:

A novel 3D annotation strategy fits the GHUM statistical body model to 2D keypoints, silhouettes, and ordinal depth annotations. The objective

$L_\text{fit} (\theta, \beta, r, t) = \lambda_{2D} \sum_i \| \Pi(rV_i(\theta, \beta) + t) - x_i \|_2^2 + \lambda_\text{sil} \| S(\theta, \beta, r, t) - S_\text{gt} \|_1 + \lambda_\text{reg} (\|\theta\|_2^2 + \|\beta\|_2^2) + \lambda_\text{ord} \text{Loss}_\text{ord}$

incorporates projection, silhouette, regularization, and depth-ordering constraints. Human-provided ordinal depth reduces ambiguous 3D fits from ~25% to ~3% error.

5. Practical Performance and On-Device Optimization

Model Size and Complexity:

Variant	Parameters	FLOPs (MFLOPs)	Desktop FPS	Pixel 2/4 CPU FPS	Pixel 4 GPU ms
BlazePose Full	3.5M	6.9	—	~10	—
BlazePose Lite	1.3M	2.7	—	~31	—
GHUM Lite	<10M	—	77	40 (CPU)	8 (GPU)
GHUM Full	<10M	—	67	25 (CPU)	9 (GPU)
GHUM Heavy	<10M	—	34	147 (CPU)	22 (GPU)

Quantization: 8-bit integer quantization for weights and activations via TensorFlow Lite; XNNPACK employed for optimized CPU inference.
GPU Acceleration: Delegation to OpenGL or NNAPI further reduces latency (<20 ms per frame on GPU).
Hand Integration: Two-stage-crop approach provides accurate hand/keypoint integration with minimal computational burden.

The models consistently deliver single-thread mobile runtimes well above 30 FPS (Lite), with sub-30 ms end-to-end latency on modern hardware, while retaining state-of-the-art 2D/3D accuracy (Bazarevsky et al., 2020, Grishchenko et al., 2022).

6. Holistic Landmark Modeling and Body+Hand Integration

BlazePose GHUM Holistic integrates body and hand keypoints using a pipeline that composes body (33×3D) and hand (2×21×3D) keypoints into a unified representation. This sequence is processed by an MLP-Mixer to predict parameters for the GHUM statistical model, yielding coherent full-body meshes:

Hand Recropping: BlazePose first regresses four palm points per hand. A small CNN refines the affine crop, aligning hand ROIs to the input distribution of the high-resolution MediaPipe Hands model.
Cascade Training: Body and hand modules are trained independently but are executed in the same inference graph, facilitating low latency holistic pose estimation.
Mesh Output: The predicted GHUM parameters are decoded into a 10,168-vertex mesh via a differentiable mesh generator, providing a complete geometric representation for downstream applications including motion capture, avatar control, and AR/VR.

7. Research Context and Applications

BlazePose and BlazePose GHUM Holistic have introduced and popularized real-time, on-device human pose estimation suitable for edge devices. Their key design—an efficient, cascade-based detector/tracker paired with heatmap+regression hybrid training—enables deployment in latency-constrained settings such as fitness tracking, sign language recognition, augmented reality, and markerless motion capture. The pipelines represent significant advances in mobile-optimized network architecture, robust landmark regression, and full-body/hand integration, as documented in foundational works (Bazarevsky et al., 2020, Grishchenko et al., 2022). Novel techniques for 3D annotation using statistical model fitting with ordinal loss resolve ambiguities prevalent in monocular 3D estimation.

The modular pipeline supports multiple accuracy/performance tradeoffs (Lite/Full/Heavy), and the approach to hand and holistic tracking avoids the computational expense associated with high-resolution inputs or monolithic nets, while maintaining compatibility with state-of-the-art hand landmark models. The reproducibility and deployability of the approaches are underpinned by their effective use of lightweight convolutions, quantization, and separable architectures, alongside a clear separation of body and hand sub-models for training and inference.

Markdown Report Issue Upgrade to Chat

References (2)

BlazePose: On-device Real-time Body Pose tracking (2020)

BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlazePose Pipeline.