SonarWatch: Ultrasonic Sensing for Smartwatches
- SonarWatch is an ultrasonic field-sensing and gesture recognition system that leverages standard smartwatch hardware to detect wrist and hand gestures in diverse acoustic environments.
- It employs a multi-modal approach by combining ultrasonic chirps captured through dual microphones with 9-axis IMU data processed via a gradient-boosted decision tree model for robust gesture differentiation.
- The system achieves high accuracy across static, dynamic, and fine-motor gestures while maintaining low power consumption, making it viable for always-on, real-world applications.
SonarWatch denotes an ultrasonic field-sensing and gesture-recognition technique for wrist-worn smartwatches, leveraging standard audio transducers and inertial measurement units (IMUs) to enable rich, low-power, continuous interaction without additional hardware. The system synthesizes an active acoustic field (16.5–20 kHz) using the built-in speaker, receives the reflected field via two microphones, and fuses these ultrasonic sensing channels with kinematic data from a 9-axis IMU via a multi-modal machine learning pipeline, yielding robust recognition rates for explicit and fine-motor on-wrist gestures in diverse acoustic environments (Shi et al., 2024).
1. Hardware Architecture and Sensing Modality
SonarWatch repurposes commodity smartwatch hardware—specifically, a standard left-mounted speaker (driven at 48 kHz), two microphones (one bezel-mounted, one under-screen), and a central Hi221 9-axis IMU (sampled at 200 Hz). The system emits a continuous cycle of ultrasonic chirps (16.5 kHz up to 20 kHz and back), partitioned into blocks of 4,096 samples (≈85 ms per “chunk”), at a rate of ~11.7 Hz (Shi et al., 2024). Ultrasonic reflections from the wrist, hand, or proximate objects are transduced by the microphones, digitized (using, e.g., a TASCAM DR-05X as a 48 kHz sound card), and fed, along with IMU data over Bluetooth, to the computing unit (typically a Raspberry Pi 4B or MacBook Pro for experimental setups).
The system diagram is as follows:
- [Speaker driver] → Ultrasonic chirp → [Propagates in air, interacts with hand/wrist/objects] → [Mic preamps] → [Sound card ADC] → [Computing unit]
- [IMU] → [Bluetooth] → [Computing unit]
The wide acoustic beam (θ ≈ λ/D ≈ 2 rad, where λ ≈ 2 cm at 17 kHz and D ≈ 1 cm speaker aperture) provides robust field coverage of the wrist region.
2. Signal Processing and Feature Extraction
The SonarWatch pipeline performs multiple pre-processing and feature derivation steps from both audio and inertial input streams:
- Band-pass filtering: Received signals are filtered to 16.5–20 kHz:
where is unity in-band, zero otherwise.
- Short-Time Fourier Transform (STFT):
with samples (50% overlap). The sum of spectral energy in kHz forms a key feature.
- Matched filtering/cross-correlation:
to amplify gesture-specific echo signatures.
- Windowing:
- For static gestures, sliding 300 ms windows (stepping 150 ms); for dynamic gestures, 300 ms windows centered on detected IMU acceleration peaks.
- IMU feature extraction: Statistical moments (mean, variance, skewness, kurtosis, max) of acceleration and angular rate (all axes) and derived Euler angles are computed for each window, yielding a compact motion descriptor.
The achieved range resolution is m, sufficient for gesture discrimination but not detailed pose estimation.
3. Sensor Fusion and Classification Pipeline
Sensor fusion is achieved by concatenating spectral and statistical audio features from both microphones with IMU-derived descriptors for each 300 ms analysis window.
- Model: LightGBM (gradient-boosted decision trees, leaf-wise) for multi-class gesture classification, optimized for logarithmic loss:
where is ground-truth, predicted probability for each class.
- Training: 12 participants (balanced by gender), ten-fold within-participant cross-validation, yielding robust model fit across inter- and intra-participant variability.
Performance metrics include accuracy, precision, recall, and 0; confusion matrices highlight both dominant gesture classes and misclassification statistics.
4. Experimental Results and Recognition Performance
Comprehensive evaluation covered 15 explicit gestures, categorized as static (e.g., Wrist-Up, Block Left, Cover Screen), dynamic (e.g., Click Mic, Pinch sides), indirect/fine motor (Pinch, Rotate, Bend), and object/body interactions (e.g., Phone→Watch).
Recognition rates:
| Gesture Group | Recognition Accuracy (%) |
|---|---|
| 12-way direct gestures + no-op | 93.7 |
| Opposite-side hand gestures (6 cls) | 95.76 |
| Indirect (same-side, fine-motor, 5) | 97.6 |
| Body/object interactions (3 classes) | 99.1 |
| Noise environments (17–65 dB SPL) | 92.6–93.7 |
No statistically significant accuracy degradation was observed across environments (lab, outdoors, mall, restaurant; RM-ANOVA 1). In real-time user testing (8 new users, 10 reps/gesture in 30 dB white noise), initial recognition reached 90.5%, with a majority of false activations removable by window-consensus post-filtering.
5. Power Consumption and Real-World Deployability
Empirical measurements on production Mi watches indicate SonarWatch drains ≈1.6% of battery over 40 minutes of continuous use (≈0.04%/min), comparable to 24h heart rate monitoring (≈1.5%) and far below continuous music playback (≈6.5% for 40 min), confirming feasibility for always-on use on daily-wear devices. Component power breakdown is additive over idle, sensor, IMU, and BLE loads.
6. Limitations, Confounds, and Prospective Enhancements
- Hardware dependence: Current design is constrained to devices exposing accessible speaker and dual mic hardware; only left-wrist, screen-up configurations were validated.
- Gesture taxonomy: Limited to discrete, explicit gestures. Continuous/proportional input (e.g., 2D cursor via Doppler+Tilt) or daily activity inference was not implemented.
- Audio configuration: Single chirp scheme (linear up-down sweep); alternative waveform families (e.g., OFDM, Zadoff-Chu) may improve spatial resolution.
- Sensor topology: Two-mic configuration enabled side-discrimination; multi-mic arrays could furnish hand/pose localization. Sensor placement remains a modifiable parameter with future devices.
- Cross-platform: Only watch-side scenarios were examined, though the approach generalizes to cross-device interactive regimes (e.g., phone–watch coordination).
- Noise robustness: No significant performance loss up to 65 dB SPL; a plausible implication is that SonarWatch would be viable in most consumer or urban settings.
- Extensibility: The feature/decision pipeline admits integration of further modalities (e.g., accessibility, XR) and learning frameworks, and supports multi-watch cooperation.
7. Significance and Future Research Directions
SonarWatch demonstrates that leveraging standard smartwatch transducers and IMU data with data-driven fusion and machine learning enables accurate, low-power, real-time recognition of a broad set of wrist-centric gestures. The approach bypasses the constraints of capacitive or vision-based interaction in compact form factors, without requiring hardware modification. Potential research trajectories include: hybrid chirp and coded excitation for improved spatial separability, continuous control schemes, integration into multi-device and accessibility frameworks, and large-scale, real-world deployment studies to quantify adaptability across populations and device models (Shi et al., 2024).