Earable Acoustic Sensing

Updated 27 December 2025

Earable acoustic sensing is a domain leveraging integrated in-ear microphones, speakers, and inertial sensors to capture acoustic, physiological, and environmental signals.
It employs both air- and bone-conduction techniques with advanced signal processing and on-device machine learning to enable robust, non-invasive measurements and authentication.
Applications range from health monitoring and cognitive load inference to speech enhancement, achieving high precision in tasks like heart-rate detection and gesture recognition.

Earable acoustic sensing is a research-intensive domain concerned with exploiting microphone, speaker, and inertial subsystems integrated in in-ear wearable devices (“earables”) for continuous, non-invasive sensing of physiological, behavioral, acoustic, and environmental signals. The field encompasses a wide variety of methodologies, including air- and bone-conduction acoustic sensing, advanced signal processing, and on-device inference with deep machine learning models. Earable acoustic sensing supports applications in human-computer interaction, health monitoring, cognitive state inference, authentication, and beyond, offering fine-grained measurements in robust, mobile form factors.

1. Physical and Acoustic Sensing Principles

Earable acoustic sensing leverages both air-conducted and bone-conducted transmission paths to capture signals of interest (Hu et al., 6 Jun 2025, He et al., 2 Dec 2025). Air-conduction microphones (MEMS or electret) capture conventional audio signals—speech, environmental sound, or purposely emitted probe tones—transmitted through the ear canal. Bone-conduction and body-conducted sensing, by contrast, captures vibrations propagating through the user’s skull, teeth, or jaw, often amplified by the occlusion effect (boosted low-frequency energy when the ear canal is sealed) (Ma et al., 2021, Wang et al., 2022).

Mathematically, the occluded ear canal is modeled as an acoustic cavity with impedance $Z_{\mathrm{can}}(\omega) = -jZ_0\cot(kL)$ (with $Z_0$ the characteristic impedance and $k$ the wavenumber). The transfer function from bone-conducted input to in-canal pressure is:

$H_{\mathrm{occl}}(\omega) = \frac{Z_{\mathrm{can}}(\omega)}{Z_{\mathrm{can}}(\omega)+Z_{\mathrm{src}}(\omega)}$

where $Z_{\mathrm{src}}$ is the source impedance. In practice, the occlusion effect yields up to 40 dB SNR boost in the sub-1 kHz band (Ma et al., 2021).

Specialized probes, such as stimulus-frequency otoacoustic emissions (SFOAEs), are injected through the speaker and measured via the in-ear mic to assess cochlear responsiveness as a proxy for auditory or cognitive state (Wei et al., 20 Dec 2025). Bone conduction is also exploited using microelectromechanical (MEMS) accelerometers to directly measure vocal tract vibrations during speech, providing robust, ambient-noise-insensitive sensing (He et al., 2 Dec 2025).

2. Hardware Architectures and Sensor Integration

State-of-the-art earable systems typically employ a combination of inward-facing and outward-facing MEMS microphones (24-bit ADC, dynamic range ≥100 dB SPL), inertial measurement units (accelerometer/gyro), bone-conduction sensors, and, in some platforms, optical and thermal sensors (Montanari et al., 7 Oct 2024, Hu et al., 6 Jun 2025, He et al., 2 Dec 2025). Inward mics are positioned within the ear canal, flush with the casing, to ensure stable coupling and maximal sensitivity to occlusion-boosted signals and local physiological events (e.g., heart sounds, OAEs).

A representative architecture, as in OmniBuds, consists of:

Sensor Type	Placement	Primary Roles
2× Outward Mics	Shell exterior	Beamforming, ANC, environmental sensing
1× Inward Mic	Facing ear canal	In-ear/occlusion signals, ANC reference
IMU (Acc/Gyro)	Near concha/shell	Bone/vocal vibration, gesture tracking

Sampling rates typically range from 16–48 kHz (audio) and ≥1.6 kHz (vibration channels); dedicated DSP cores and CNN accelerators allow on-device feature extraction and inference with minimal latency and power overhead (Montanari et al., 7 Oct 2024).

3. Signal Processing and Feature Extraction Pipelines

Signal acquisition pipelines perform application-specific filtering, modulation, and feature extraction:

Preprocessing: Band-pass or low-pass filtering (20–500 Hz for phonocardiogram; <50 Hz for motion; 0–20 kHz for general audio). Adaptive filters and artifact rejection are used to compensate for noise, occlusion variation, and motion artifacts (Hu et al., 6 Jun 2025, Montanari et al., 7 Oct 2024).
Feature Extraction: Domain-specific features such as log-mel spectrograms, MFCCs, pitch, spectral entropy, GCC-PHAT (spatial localization), wavelet coefficients, and envelope-based metrics are computed from short-time Fourier transforms or filtered time-domain signals (Montanari et al., 7 Oct 2024, Lee et al., 2023).
Occlusion and Bone Path Separation: Dedicated filtering isolates bone-conducted low-frequency bands. In OESense, an inward mic and sealing eartip are used with a <50 Hz filter to extract motion-induced envelope signatures (Ma et al., 2021).
Acoustic Probing: Embedded probe tones or chirps are synchronized with task audio for OAE capture. Band-stop and band-pass filtering isolates probe frequencies from broadband signals, enabling the extraction of cochlear emissions as in cognitive load inference (Wei et al., 20 Dec 2025).

4. Machine Learning and Inference Architectures

Multiple inference architectures are employed, from lightweight classical classifiers to hybrid deep learning models:

Classical Approaches: Hilbert-envelope peak detectors for steps and gestures (Ma et al., 2021), SVM/logistic regression for activity recognition, contrastive learning for context-aware scenario recognition (He et al., 3 Apr 2025).
Convolutional Neural Networks: CNNs for acoustic event detection, keyword spotting, and heart-sound segmentation, typically operating on mel-spectrograms or wavelet features. Parameters are quantized and inference is managed by on-device ML engines, yielding sub-50 ms end-to-end latency (Montanari et al., 7 Oct 2024).
Hybrid Modal Fusion: Multi-branch encoder-decoder models fuse audio and vibration signals via residual CNNs and dual-path RNNs; auxiliary decoders prevent modality collapse (VibOmni) (He et al., 2 Dec 2025).
Adaptive and Continual Learning: On-device SNR estimators guide dynamic inference depth or skip conditions, with continual self-supervised learning to accommodate evolving noise environments and user profiles (He et al., 2 Dec 2025).
LLM-Based Zero-Shot Reasoning: Structured features (scenario, event, motion) are input to LLMs for zero-shot or few-shot human activity recognition, fusing contextual and spatial streams (He et al., 3 Apr 2025).

5. Representative Applications and Performance Benchmarks

Earable acoustic sensing is deployed for a range of user- and context-facing applications:

Physiological Monitoring: Phonocardiogram-based heart rate (MAE: 1.8 BPM), blood pressure estimation via systolic time intervals (RMSE: 7.2 mmHg), spirometry and respiration detection (F1: 0.90), OAE-based hearing screening (sens: 100%, spec: 89.7%) (Chan et al., 2022, Montanari et al., 7 Oct 2024, Hu et al., 6 Jun 2025).
Cognitive Load and Augmented Cognition: In-ear SFOAE measurement enables real-time inference of cognitive load with load prediction accuracy of 75–80% (cross-validated), correlation $R\approx0.8-0.9$ between acoustic energy metrics and task demand (Wei et al., 20 Dec 2025).
Interaction and Activity Recognition: Step counting (recall: 99.3%), gesture recognition (recall: 97.0%), user authentication via acoustic toothprints (accuracy: 92.9% with 1 gesture), and detection of music-induced reactions (macro F1: 0.90) (Ma et al., 2021, Wang et al., 2022, Lee et al., 2023).
Speech Enhancement and Noise Robustness: Bone-conduction-informed speech enhancement yields up to 21% improvement in PESQ and 40% WER reduction in noisy environments (He et al., 2 Dec 2025).
Health and Clinical Use: At-home screening for hearing loss via low-cost, open-source earbuds using OAE detection with performance comparable to $8,000 commercial devices (Chan et al., 2022).

6. System Limitations and Open Research Challenges

Despite notable advances, several technical hurdles persist (Hu et al., 6 Jun 2025):

Seal and Fit Variation: Signal quality in PCG/PPG/occlusion-based systems can be strongly affected by minute fit variations and acoustic leakage. Real-time leakage estimation and adaptive calibration remain active research problems.
Motion and Environment Artifacts: Walking, talking, and ambient noise continue to present challenges for artifact rejection in body-conduction and air-acoustic channels. Resilient algorithms and multi-modal fusion are required for robust real-world use (Hu et al., 6 Jun 2025, He et al., 2 Dec 2025).
On-Device Resource Constraints: DSP and ML models are often power-, memory-, and compute-limited in consumer earbuds. Efficient quantized models on dedicated accelerators, adaptive execution, and offloading strategies are critical (Montanari et al., 7 Oct 2024).
Generalization Across Hardware: Most studies remain on custom prototypes; systematic performance studies and toolkits bridging academic, open-source (e.g., OmniBuds, OpenEarable), and commodity hardware are required for ecosystem convergence.
Privacy and Security: Continuous audio capture and processing raise privacy concerns, particularly when leveraging sensitive health or behavioral markers. End-to-end encryption, on-device processing, and policy-compliant design are under exploration.

7. Outlook and Emerging Directions

Hardware trends point toward native support for broadband streaming, symmetric bi-ear deployment, and integration of diverse sensor modalities (bone conduction, optical, barometric, thermal). Further advances are anticipated in standardized multimodal fusion frameworks, cross-device collaboration, and multi-task foundation models for physiological and activity sensing (Hu et al., 6 Jun 2025).

Applications are expected to expand to include ear-centered health biomarker extraction (e.g., vascular stiffness, early pulmonary pathology), adaptive hearing augmentation, and implicit continuous authentication. Usability frontiers include extended battery life, comfort for long-term wear, and privacy-preserving on-device intelligence suitable for widespread public adoption.

References:

(Wei et al., 20 Dec 2025) Listening to the Mind: Earable Acoustic Sensing of Cognitive Load
(Ma et al., 2021) OESense: Employing Occlusion Effect for In-ear Human Sensing
(Hu et al., 6 Jun 2025) A Survey of Earable Technology: Trends, Tools, and the Road Ahead
(He et al., 2 Dec 2025) VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables
(Chan et al., 2022) Wireless earbuds for low-cost hearing screening
(Wang et al., 2022) Ear Wearable (Earable) User Authentication via Acoustic Toothprint
(Montanari et al., 7 Oct 2024) OmniBuds: A Sensory Earable Platform for Advanced Bio-Sensing and On-Device Machine Learning
(Lee et al., 2023) Automatic Detection of Reactions to Music via Earable Sensing
(He et al., 3 Apr 2025) EmbodiedSense: Understanding Embodied Activities with Earphones