BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs (1907.05047v2)

Published 11 Jul 2019 in cs.CV

Abstract: We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200-1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression.

Citations (268)

View on Semantic Scholar

Summary

The paper introduces BlazeFace, a novel model that achieves sub-millisecond face detection on mobile GPUs while improving overall accuracy.
BlazeFace employs a compact network with GPU-friendly anchors and depthwise convolutions to optimize inference speed for real-time applications.
Experimental results show a 98.61% average precision with reduced jitter and enhanced regression performance, making it ideal for AR pipelines.

BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs

The paper "BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs" by Valentin Bazarevsky et al., presents an efficient and effective face detection model tailored for mobile applications, particularly focusing on mobile GPUs. The model, BlazeFace, exemplifies advancements in speed and accuracy, crucial for real-time augmented reality (AR) applications requiring swift and precise face detection.

BlazeFace is designed around several key contributions aimed at optimizing inference on mobile devices. The authors introduce a compact feature extraction network, drawing inspiration from but distinct in design compared to MobileNetV1/V2. This network employs a novel GPU-friendly anchor design adapted from the Single Shot MultiBox Detector (SSD), which enhances GPU utilization efficiency. Moreover, the introduction of an alternative tie resolution strategy to non-maximum suppression offers improved stabilization in overlapping prediction scenarios.

Model Architecture and Optimizations

The BlazeFace architecture is structured around enhancing receptive field sizes while maintaining a lightweight framework necessary for mobile GPUs. The model increases the depth of intermediate layers, and strategically places additional depthwise convolutional layers to expedite the receptive field progression. Its architecture consists of a blend of BlazeBlocks and double BlazeBlocks, emphasizing greater computational efficiency and resource allocation, making it adept for the fast-paced processing environment of mobile GPUs.

The anchor scheme employed in BlazeFace deviates from traditional SSD frameworks by utilizing anchors solely at a specific, higher resolution (8x8 grid size), as opposed to multiple resolution levels. This design reduces computational overhead, particularly beneficial in GPU-based computation, where there is a significant fixed cost in dispatching operations for low-resolution layers. Additionally, the approach limits anchor aspect ratios to 1:1, aligning well with human facial structures.

Performance and Evaluation

In experimental evaluations, BlazeFace achieves 98.61% average precision for frontal face detection, an improvement compared to a MobileNetV2-SSD based object detector. Moreover, its inference times on various flagship devices demonstrate impressive reductions, with the iPhone XS achieving sub-millisecond performance at approximately 0.6 milliseconds. The model's architecture delivers adequate detection accuracy while greatly minimizing computational latency—a pivotal achievement for real-time AR applications.

The provided statistical evidence underscores the model's impact: a 10% accuracy improvement in jitter reduction, and enhanced regression parameter prediction whilst maintaining efficient computational demands. Such results empower BlazeFace as a practical tool for AR pipelines, drawing a significant balance between speed and precision.

Applications and Implications

BlazeFace is adaptable to multiple face-related tasks beyond simple detection. Its applications span 2D/3D facial keypoint estimation, facial feature classification, and face region segmentation. By providing vital facial keypoint information, BlazeFace facilitates subsequent model stages in AR pipelines, allowing for pre-processed data that is rotated and scaled appropriately. This alleviates the computational demands on downstream models, leading to potentially richer and more complex face-related AR tasks without sacrificing performance.

Future Directions

BlazeFace's framework highlights potential areas for further research, including refining the tie resolution strategies and exploring varied anchor outline configurations. Additionally, further exploration might establish optimization techniques applicable to other object detection tasks on mobile devices, potentially influencing broader fields involving real-time video data processing and analysis.

In conclusion, the BlazeFace model presents a significant advancement in mobile face detection technologies, performing with remarkable accuracy and extraordinary speed. The research implicates future developments in AR and consumer applications, strengthening the potential of neural networks operating efficiently on lightweight, resource-constrained environments like mobile devices.

PDF Markdown