- The paper introduces a neural network framework that reconstructs a detailed 3D facial mesh with 468 vertices from single-camera video.
- It employs a lightweight face detector with subsampling and temporal filtering to achieve impressive inference speeds ranging from 100 to over 1000 frames per second.
- The approach maintains high prediction accuracy comparable to manual annotations, with a mean absolute distance error of 3.96% normalized by interocular distance.
Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs
The paper presents a neural network-based solution for inferring a 3D mesh representation of a human face from single-camera video input, optimized for mobile GPUs. The model demonstrates impressive real-time inference speeds, ranging from 100 to over 1000 frames per second depending on device capabilities, while maintaining prediction quality comparable to manual annotations.
Key Contributions and Methodology
The authors propose an end-to-end framework that addresses the challenges of real-time facial mesh reconstruction on mobile platforms, where computational resources are often constrained. The model focuses on generating a relatively dense mesh consisting of 468 vertices, which are well-suited for face-based augmented reality (AR) applications. This approach diverges from traditional methods that often rely on fewer (around 68) key landmarks.
Image Processing Pipeline
The processing involves a lightweight face detector that identifies and aligns facial regions from the camera input. Subsequently, a neural network predicts the 3D coordinates of mesh vertices. The architecture uses subsampling in initial layers to allow neurons to cover substantial areas early in the process, optimizing resource usage and improving prediction quality.
When applied to video feeds, the model benefits from a temporal filter to mitigate jitter in landmark trajectories caused by frame-to-frame inconsistencies, effectively enhancing the visual appeal of the rendered output in AR applications.
Dataset and Training
The training leverages a diverse, globally sourced dataset of around 30,000 photos under varying conditions, augmented to simulate sensor noise and challenging lighting. Ground truth annotations are derived using a novel iterative refinement process where initial predictions are improved through semi-automatic adjustments, optimizing both X and Y coordinates while leaving the Z-axis derived from synthetic supervision.
Performance Evaluation and Results
The model's robustness is evaluated against a varied dataset, with performance metrics indicating a mean absolute distance (MAD) error of 3.96% normalized by interocular distance, for the full model running at high speeds on mobile devices like the iPhone XS and Pixel 3. These results denote an efficient trade-off between computational load and prediction accuracy—essential for practical deployment in interactive applications.
Implications and Future Work
The implications of this research are significant for AR applications on mobile platforms, where real-time performance and accuracy are critical. The demonstrated approach enables developers to implement detailed facial interactions and augmentations without the need for specialized hardware beyond standard mobile cameras.
Future developments could explore enhancing the model's generalization to diverse facial geometries and expressions, potentially integrating additional modalities like depth sensors for even finer precision. Furthermore, refining the temporal stability of predictions could increase applicability in more dynamic environments.
Overall, the paper contributes an effective and efficient solution for facial geometry reconstruction on mobile devices, providing a foundation for richer AR experiences.