FaceOri: Ultrasonic Head Tracking

Updated 17 September 2025

FaceOri is a system that uses ultrasonic FMCW chirps captured by earphone microphones to determine head yaw, pitch, and distance with high precision.
The system integrates adaptive filtering and geometric reconstruction methods to achieve median absolute errors of 10.9 mm in distance and sub-6° in angular measurements.
FaceOri enables privacy-preserving, hands-free interaction for context-aware devices by overcoming limitations of vision-based tracking with robust, non-line-of-sight performance.

FaceOri denotes a system for tracking user head position and orientation—specifically yaw, pitch, and distance—using acoustic ranging on commodity ANC earphones equipped with multiple microphones. Rather than relying on vision-based sensing, FaceOri employs ultrasonic signals emitted from device speakers and captured at the earphone microphones to infer geometric positioning information. This enables hands-free, privacy-preserving, and robust interaction for context-aware devices and environments.

1. Principle of Operation: Ultrasonic Ranging and Geometric Reconstruction

FaceOri leverages Frequency Modulated Continuous Wave (FMCW) ultrasonic chirps transmitted by device speakers. The earphones' microphones receive the chirp after a path-dependent delay. FMCW processing establishes the time-of-flight by analyzing the frequency shift between the reference and received signals. For unsynchronized transmitter/receiver clocks, a calibration protocol aligns the reference peak frequency ( $f_p^0$ ). Each microphone's measured peak ( $f_p^d$ ) supports distance calculation:

$D = c \cdot \frac{(f_p^d - f_p^0) T}{B}$

where $c$ is sound speed, $T$ sweep duration, and $B$ sweep bandwidth.

With three microphones per earphone (two ANC at similar height, one speech closer to the mouth), the system reconstructs the user's head geometry using planar triangles. Yaw estimation compares left ( $d_l$ ) and right ( $d_r$ ) ANC distances given baseline ( $d_e$ ):

$d_m = \frac{1}{2} \sqrt{2d_l^2 + 2d_r^2 - d_e^2}$

$\alpha = \arccos \left( \frac{d_m^2 + 0.5 d_e^2 - d_r^2}{d_m d_e} \right)$

$\phi = \alpha - 90^\circ$

Pitch is obtained analogously by relating ANC and speech mic distances, with the known inter-mic baseline.

2. System Architecture and Processing Pipeline

FaceOri includes:

Transmitter Module: A commodity device (e.g., smartphone) emits a triangularly modulated ultrasonic chirp (both up- and down-chirps) to compensate for Doppler effects from head motion.
Receiver Module: ANC earphones with at least two ANC microphones and a speech mic receive the signal, digitized via standard audio interfaces.
Processing & Algorithms: Audio signals are processed using FMCW mixing, FFT-based peak detection (with adaptive CFAR thresholding for noise robustness), followed by geometric solution of the user's head position and orientation. Real-time computation is performed by an on-device processor (or in prototype, a laptop). For attention/binomial orientation detection, extracted features (amplitude across frequency bands, inter-mic time difference) are fed into a SVM for classification.

The architecture is diagrammed as:

[ Device Speaker ] --> [ Air, Ultrasonic FMCW Chirp ] --> [ Earphone Mic Array ] --> [ Audio Interface ] --> [ Fast DSP: FMCW, FFT, CFAR ] --> [ Geometric Computation (Triangles) ] --> [ Yaw, Pitch, Distance Output ]

3. Calibration, Robustness, and Attention Classification

Calibration is required once per session: aligning the earphone microphone within 2 mm of the speaker to establish $f_p^0$ (reference peak frequency). Adaptive non-coherent integration and CFAR filtering mitigate multipath and Doppler effects, crucial for non-LOS conditions and unstable hand movements.

For binary attention classification (i.e., whether user is facing the device), FaceOri employs feature extraction from the frequency domain response of each microphone (20 bands between 17.5 and 23.5 kHz) plus inter-mic time differences; a support vector machine achieves robust “attention” state classification independent of distance or explicit calibration.

4. Empirical Performance and Quantitative Results

FaceOri has been evaluated using OptiTrack ground truth and user studies:

Attention classification: 93.5% average accuracy within 1.5 m range (standard deviation 2.5%).
Continuous head tracking: median absolute errors of 10.9 mm (distance), 3.7° (yaw), 5.8° (pitch). For comparison, a CAT baseline yielded 42.0 mm (distance), 11.0° (yaw), 11.6° (pitch) errors. Evaluation covered different device heights, head motion patterns, and user postures, with robust operation under NLOS and motion-induced Doppler.

5. Limitations and Operational Challenges

Requirements include calibration (mic alignment for reference frequency) and sensitivity to hardware variability. While the system is robust to NLOS and moderate Doppler, hardware heterogeneity among earphones may affect accuracy. A plausible implication is that broader deployment will require further studies for device-agnostic tuning. Privacy and battery lifetime are improved relative to vision sensors, but temporal calibration may be needed per battery cycle. For applications requiring sub-degree pose estimation (e.g., medical diagnosis, fine-grained biometrics), RGB/depth vision remains superior.

6. Extensions, Applications, and Future Directions

Potential developments and applications:

Calibration minimization: Fusion with camera-, depth-, or Bluetooth synchronization may obviate initial physical alignment.
Sensor fusion: Integration with inertial measurement units (IMU) enables drift correction and continuous tracking beyond the acoustic system’s temporal resolution.
On-device processing: DSPs and microcontrollers in modern earphones are technically capable of real-time FMCW demodulation and geometric inference, suggesting that FaceOri could be deployed at scale without external hardware.
Multi-device environments: Frequency/code multiplexing strategies could enable attention tracking across several devices in a shared room.

Use cases include touchless device interaction, context-aware display adaptation, biometric attention sensing, mobile and wearable interaction in privacy-sensitive contexts, and activity/gesture recognition.

7. Context within Broader Research and Implications

FaceOri advances head orientation sensing without cameras, unlike major face tracking or pose estimation methods. By relying exclusively on acoustic ranging and geometric modeling, it provides a solution for environments where imaging modalities are infeasible or undesirable due to privacy. The system achieves comparable tracking precision for device interaction purposes, and its technical architecture can be generalized to other domains requiring multi-dimensional user pose sensing via spatially separated sensors.

A plausible implication is that FaceOri serves as a prototype and reference for future head pose sensing frameworks, especially as wearables proliferate and privacy expectations rise. The technical methodology—combining FMCW, geometric modeling, adaptive filtering, and calibration-efficient protocols—may inform subsequent work in sensor fusion and non-vision-based tracking in ubiquitous computing contexts.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to FaceOri.