BlazeFace: Mobile Face Detection

Updated 19 February 2026

BlazeFace is a neural face detector optimized for mobile GPUs, delivering sub-millisecond inference for AR applications.
Its architecture features compact BlazeBlock modules with depthwise separable convolutions and a simplified SSD-derived anchor scheme to reduce jitter.
Empirical benchmarks show BlazeFace achieves up to 1000+ FPS with improved temporal stability and accuracy for tasks like facial keypoint estimation and segmentation.

BlazeFace is a neural face detector specifically optimized for sub-millisecond inference on mobile GPUs. Designed to support augmented reality (AR) pipelines demanding highly accurate and temporally stable face localization, BlazeFace introduces a suite of architecture, inference, and deployment innovations distinct from prior lightweight detectors such as MobileNetV1/V2-SSD. Its critical design elements enable throughput of 200–1000+ FPS on flagship devices without quantization or heavy pruning, facilitating real-time downstream tasks including facial keypoint estimation, geometry reconstruction, expression classification, and facial region segmentation (Bazarevsky et al., 2019).

1. Network Architecture and Innovations

BlazeFace’s network architecture comprises a compact, depthwise-separable convolutional backbone tailored for face detection in low-resolution inputs (128×128×3 RGB). The backbone consists of initial 5×5 convolution followed by repeated, customized “BlazeBlock” modules. The key features are:

Single BlazeBlock: Composed of a 5×5 depthwise convolution, batch normalization (BN), ReLU, and a 1×1 pointwise convolution with BN. Identity residuals are present if stride is 1.
Double BlazeBlock: Diverges from the MobileNetV2 bottleneck by using a Project–DW–Expand order (as opposed to Expand–DW–Project), with residuals operating in the expanded channel space and doubled depthwise convolutions to increase receptive field with fewer blocks.

A schematic layer-by-layer breakdown is shown below:

Layer	Input	Kernel / Structure	Output
Convolution	128×128×3	5×5, 3→24, stride 2	64×64×24
SingleBlazeBlock×2	64×64×24	(5×5 dwise + 1×1 pw) 24→24	64×64×24
SingleBlazeBlock (s=2)	64×64×24	… 24→48	32×32×48
SingleBlazeBlock×2	32×32×48	… 48→48	32×32×48
DoubleBlazeBlock (s=2)	32×32×48	1×1:48→96; 5×5 dw; 1×1:96→48	16×16×48
DoubleBlazeBlock×2	16×16×48	…	16×16×48
DoubleBlazeBlock (s=2)	16×16×48	…	8×8×48
DoubleBlazeBlock×2	8×8×48	…	8×8×48

The final feature map is of size 8×8×96, to which SSD-style detection heads are attached. The design choice to favor 5×5 (over 3×3) in depthwise convolution is motivated by the large receptive field expansion at marginal additional computational cost per block.

Standard formulas for parameter count and multiply-adds (FLOPs) per layer apply:

Parameter count: $P = \sum_l k_l^2\cdot c_{in,l}\cdot c_{out,l}$
Multiply-adds: $F = \sum_l s_l^2 [k_l^2\cdot c_{in,l} + c_{in,l}\cdot c_{out,l}]$

2. Anchor Scheme and Detection Logic

BlazeFace uses an SSD-derived one-stage detection strategy with GPU-optimized anchor assignment.

Anchor configuration: Only square (1:1 aspect ratio) anchors are used, reflecting the limited aspect variation in AR face pipelines. Six different anchor scales are mapped exclusively onto the 8×8 feature map (yielding 384 anchors per image).
Simplified anchor generation: All anchors are placed at the 8×8 map resolution, eschewing lower-resolution (4×4, 2×2, 1×1) anchor sets characteristic of standard SSD. This approach exploits mobile GPU efficiency for medium-sized tensors and avoids overhead associated with dispatching many small tensors.
Box regression: Given anchor $(x_a, y_a, w_a, h_a)$ , and predicted offsets $(\Delta x, \Delta y, \Delta w, \Delta h)$ , the final box is parameterized as

$\hat x = x_a + w_a\cdot\Delta x \ \hat y = y_a + h_a\cdot\Delta y \ \hat w = w_a\cdot\exp(\Delta w) \ \hat h = h_a\cdot\exp(\Delta h)$

Anchor–ground-truth matching: Standard SSD rules are followed—each ground-truth box is matched with the anchor of highest IoU, and all anchors with IoU ≥ 0.5 are assigned that ground-truth. All other anchors are treated as background.

This scheme is optimized for mobile GPU throughput and training stability (Bazarevsky et al., 2019).

3. Tie-Resolution and Temporal Stability

Standard SSD-based detectors use Non-Maximum Suppression (NMS) to eliminate redundant detections; however, this can produce visual “jitter” in video streams as different anchors alternate as the selected detection.

Blending-based NMS alternative: Rather than keep only the top-scoring box, BlazeFace performs weighted averaging (“blending”) of all boxes overlapping (IoU ≥ $T$ , e.g., 0.5) with the highest-scoring detection. The averaged box coordinates take the form:

$\bar x = \frac{\sum_j (\text{score}_j \cdot x_j)}{\sum_j \text{score}_j}$

and similarly for $\bar y, \bar w, \bar h$ , retaining the maximum confidence as the group score.

Empirical effect: This operation increases detection accuracy by ≈10% AP on private AR datasets and decreases output “jitter” by 40% (frontal) and 30% (rear), as measured by RMS output difference across small input translations. Its implementation is computationally lightweight, adding only several FMA operations per group.

4. Empirical Evaluation and Benchmarks

BlazeFace exhibits a significant improvement in both inference latency and temporal stability compared to MobileNetV2-SSD, with only minor tradeoffs in keypoint regression error.

Inference speed (float16, single frame):
- iPhone XS: BlazeFace 0.6 ms (AP 98.61%) vs. MobileNetV2-SSD 2.1 ms (AP 97.95%).
- Further device benchmarks:
Device MobileNetV2-SSD (ms) BlazeFace (ms)

iPhone 7 4.2 1.8

iPhone XS 2.1 0.6

Google Pixel 3 7.2 3.4

Huawei P20 21.3 5.8

Samsung S9+ 7.2 3.7
Regression and jitter accuracy (median absolute error, normalized by inter-ocular distance, IOD):

Model Regression error (%) Jitter (%)

MobileNetV2-SSD 7.4 3.6

BlazeFace 10.4 5.3
Ablative findings: Moving from 3×3 to 5×5 depthwise convolutions halves the number of required bottleneck blocks, reducing parameters and GPU overhead. The single 8×8 anchor map decreases small-tensor dispatch requirements by ~60%. Blended NMS adds ~0.5 AP and reduces temporal jitter by 30–40%.

Device	MobileNetV2-SSD (ms)	BlazeFace (ms)
iPhone 7	4.2	1.8
iPhone XS	2.1	0.6
Google Pixel 3	7.2	3.4
Huawei P20	21.3	5.8
Samsung S9+	7.2	3.7

Model	Regression error (%)	Jitter (%)
MobileNetV2-SSD	7.4	3.6
BlazeFace	10.4	5.3

5. Implementation Details and Device-Level Optimizations

Kernel fusion: All convolutions are implemented as fused depthwise+pointwise operations (e.g., Metal Performance Shaders on iOS, TFLite GPU on Android).
Arithmetic and memory: Float16 arithmetic is employed throughout, minimizing memory footprint and maximizing computational throughput. Weights are arranged per the “channel-major” (NHWC) layout of mobile GPU drivers, circumventing extra transposes.
Quantization/pruning: No post-training quantization or pruning is employed; the network’s parameter count (~0.5M) and sub-millisecond latency obviate further reduction steps.

6. Application Scope and Practical Significance

BlazeFace’s primary utility lies in mobile and embedded contexts requiring efficient, accurate face detection as a precursor for AR tasks, real-time keypoint regression, facial feature analysis, and face region segmentation. Its design demonstrates how task-specific architectural and inference optimizations—specifically, the use of larger depthwise kernels, anchor scheme simplification, and temporally stable box resolution—yield substantial advances in throughput and usability for edge devices (Bazarevsky et al., 2019).

A plausible implication is that the architectural principles exemplified by BlazeFace, namely GPU-friendliness and jitter mitigation without extensive quantization, are transferable to other dense-prediction models targeting real-time applications on resource-constrained hardware.

Markdown Upgrade to Chat

References (1)

BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlazeFace.

BlazeFace: Mobile Face Detection

1. Network Architecture and Innovations

2. Anchor Scheme and Detection Logic

3. Tie-Resolution and Temporal Stability

4. Empirical Evaluation and Benchmarks

5. Implementation Details and Device-Level Optimizations

6. Application Scope and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

BlazeFace: Mobile Face Detection

1. Network Architecture and Innovations

2. Anchor Scheme and Detection Logic

3. Tie-Resolution and Temporal Stability

4. Empirical Evaluation and Benchmarks

5. Implementation Details and Device-Level Optimizations

6. Application Scope and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research