MutterMeter: Dual Bioacoustic Detection
- MutterMeter is a dual-purpose system that automates self-talk detection with earable audio and heart murmur detection from phonocardiogram signals using deep learning.
- It employs a multi-stage, hierarchical pipeline combining acoustic feature engineering, neural embeddings, and confidence-based early exits to optimize processing.
- Empirical evaluations show improved Macro-F₁ scores and reduced latency, underscoring its potential for scalable, in situ bioacoustic monitoring.
MutterMeter refers to two distinct but technically related systems—one for automatic self-talk detection via wearable audio (“earables”) and another for state-of-the-art heart rate and murmur detection from phonocardiogram (PCG) signals using deep learning architectures. Each MutterMeter system leverages multi-stage pipelines that integrate acoustic feature engineering, neural embeddings, and hierarchical classification or multi-task architectures to address application-specific challenges involving audible murmurs, acoustic ambiguity, and real-world usability constraints.
1. Self-Talk Detection via Earables
MutterMeter, as described by Lee et al., designates a mobile system for automatic self-talk detection from real-world audio captured by in-ear microphones (earables) (Lee et al., 10 Nov 2025). Self-talk—defined as momentary, self-directed, often incomplete speech used for emotion regulation and cognitive processing—remains challenging to observe in situ due to its sporadic occurrence, low amplitude, and syntactic idiosyncrasies.
1.1 Technical Challenges
Self-talk detection departs from standard speech understanding along several axes:
- Acoustic Diversity: Self-talk varies from nearly inaudible murmurs to emotionally charged exclamations.
- Linguistic Incompleteness: Utterances exhibit grammatical fragmentation, repetition, and omission of standard constituents.
- Temporal Irregularity: Unlike dialogue, self-talk has no structured turn-taking or predictable timing, often occurring in clusters under emotional stress.
Conventional approaches relying on self-reports or post-hoc human annotation are error-prone and unscalable, motivating an objective, real-time measurement solution.
2. System Architecture and Algorithmic Workflow
MutterMeter adopts a three-stage, hierarchical classification pipeline, employing confidence-gated early exits to reduce computational overhead, minimize latency, and adaptively fuse acoustic and linguistic logits (Lee et al., 10 Nov 2025).
2.1 Pipeline Stages
Preprocessing: Continuous 22 050 Hz audio streams are segmented by detecting vocal events (RMS threshold > –20 dB). Utterances longer than 300 ms with inter-frame gaps less than 800 ms are merged, and a rolling context cache (max 30 s) is maintained for context-aware transcription.
Acoustic Stage: Each utterance is embedded via an 80-channel log-mel spectrogram front end (STFT), encoded using a fine-tuned Whisper-base Transformer to yield high-dimensional features. The Locality-Aware Embedding Adaptation (LAEA) module smooths embeddings for utterances proximate in time (ΔT ≤ 4 s) using adaptive weighting:
Classification proceeds via a three-layer MLP + softmax.
Linguistic Stage: Triggered if the acoustic prediction lacks sufficient confidence (margin ) or for low-confidence ‘positive’ classes. Utterances aggregated from the context cache are transcribed with Whisper-large ASR to minimize WER (0.60 vs. 0.82 for single-utterance), then embedded through a fine-tuned Korean BERT encoder, and classified via a separate three-layer MLP.
Fusion Stage: If the linguistic margin is also low (), acoustic and linguistic embeddings are projected and merged with gated fusion:
Classification is performed over the fused representation with a five-layer MLP.
2.2 Confidence-Gated Early-Exit Logic
The system employs a confidence gating strategy to reduce unnecessary computation:
1 2 3 4 5 6 7 8 9 |
(y_a, m_a) = Acoustic(x) if (y_a in {neg, other}) and m_a >= 0.92: return y_a else: (y_l, m_l) = Linguistic(x) if (y_l == neg) and m_l >= 0.80: return y_l else: return Fusion(x) |
Thresholds (, ) are derived empirically from the least-margin distributions.
3. Dataset, Training, and Empirical Evaluation
3.1 Data Collection and Annotation
- 25 recreational/amateur tennis players (22 male, 3 female, mean experience 2.4 y)
- In situ: Three matches each, open-ear wireless earphones, synchronized to video
- 31.1 hours total; 8 900 utterances (3.9 h), subdivided into negative (26%), positive (16%), and other (58%) self-talk, durations mean 2.0 s, 1.6 s, and 1.4 s
- Labels: Assigned through manual triage using timestamped audio-video alignment by two raters
3.2 Model Training
- Acoustic: Whisper-base, fine-tuned (batch 32, AdamW, LR 2e–5, early stopping)
- LAEA: MLP, Adam, LR 1e–4
- Linguistic: Whisper-large ASR + fine-tuned Korean BERT (batch 64, LR 2e–5)
- Fusion: batch 64, Adam, LR 1e–4
- Validation: Leave-One-Subject-Out (LOSO) cross-validation
3.3 Performance Metrics and Baselines
Overall Macro-F₁ (test, LOSO): | Model | Macro-F₁ | |------------------|----------| | Acoustic only | 0.80 | | Linguistic only | 0.70 | | Fusion only | 0.82 | | MutterMeter | 0.84 |
- By class: Negative: F₁=0.84, Positive: F₁=0.77, Other: F₁=0.91
- Baseline comparison: LLM- and SER-based methods (Emotion2Vec, TweetNLP, Gemini) all scored substantially below MutterMeter (≤0.69)
- Ablations: LAEA and adaptive fusion yield +5–10% F₁ improvement on positive self-talk and overall
3.4 System Latency
- Preprocessing: 21 ms per 100 ms audio
- Acoustic: 2 015 ms per utterance (majority exit)
- Linguistic: 4 298 ms (minority invoked)
- Fusion: 0.8 ms
- Hierarchical early exits: Reduce average utterance latency from 6 335 ms to 3 713 ms (–41%)
4. Limitations and Areas for Improvement
- Noise Sensitivity: Errors driven by low SNR or overlapping background audio; ambiguous utterances without strong linguistic or prosodic cues may confound classification.
- ASR Reliability: Short or mumbled utterances are transcription-challenged, reducing the effectiveness of the linguistic stage.
- Computation: Acoustic stage incurs ~2 s per utterance latency on-device; continuous operation may impact battery life. ASR offload reduces local computation but introduces privacy and connectivity dependencies.
Potential enhancements identified include deployment of pruned or lightweight ASR on-device, user adaptation via speaker-specific fine-tuning, integration of non-audio physiological signals (accelerometer, PPG), support for silent self-talk via bone conduction or EOG, and generalization to non-sports domains (Lee et al., 10 Nov 2025).
5. Heart Rate and Murmur Detection from PCG: Relation to MutterMeter
A second line of research led by Nie et al. references “MutterMeter” as a framework for combined heart rate estimation and murmur detection in digital stethoscope use cases (Nie et al., 25 Jul 2024). Here, high-fidelity heart sounds (PCG, resampled to 16 kHz, filtered at 2 kHz) are segmented using annotated S1/S2 boundaries. Features extracted per 5 s snippet include Mel spectrogram (40 bands), MFCC (40 coeffs), power spectral density (PSD, Welch), and RMS energy.
5.1 2D-CNN and Multi-Task Learning
- Input: Four-channel feature “images” (Mel∥MFCC∥PSD∥RMS, shape 4×time×freq)
- CNN Architecture: Five convolutional layers (3×3, ReLU, max-pool), followed by a flattening layer and two parallel heads: one for heart rate classification (141-class, 40–180 bpm), one for murmur detection (binary, sigmoid)
- Training: Weighted cross-entropy for HR, weighted BCE for murmur, Adam optimizer (initial LR=1e–3, batch 16, 100 epochs, learning-rate schedulers triggered by val MAE < 2 bpm)
5.2 Performance and Benchmarks
| Label | MAE (bpm) / ACC (%) |
|---|---|
| PSD-only baseline | 8.28 (HR, MAE) |
| TCNN-LSTM baseline | 1.63 (HR, MAE) |
| 2D-CNN (single-task) | 1.312 (HR, MAE) |
| 2D-CNN-MTL (multi) | 1.636 (HR, MAE); 97.5 (Murmur ACC) |
Performance for both heart rate and murmur detection surpasses the AAMI requirement (≤10% or ≤5 bpm MAE). All four features yield the best results versus any subset. Notable limitations include lack of explicit noise reduction, drop in performance on extreme HR, requirement for 5 s windows (3 s increases error to ~3.3 bpm), and the treatment of HR as classification rather than regression.
6. Broader Significance and Application Domains
The MutterMeter systems collectively establish end-to-end pipelines for automatic physiological and psychological state measurement—either through murmured self-talk or acoustic cardiology. Both approaches emphasize:
- The need for robust acoustic feature engineering coupled with neural representation learning.
- The value of hierarchical or multi-task classification to mediate ambiguous, context-dependent audio.
- Practical trade-offs (on-device versus offloaded computation, latency, and battery demand) that condition real-world feasibility.
A plausible implication is that architectural motifs, such as hierarchical gating and multi-modal fusion, are likely extensible to other in situ bio-acoustic monitoring domains where events are of ambiguous or irregular structure. Additionally, the use of sliding window segmentation, adaptive fusion, and context caches emerges as a recurrent theme in the technical design of both MutterMeter variants.
7. References
- Enabling Automatic Self-Talk Detection via Earables
- Model-driven Heart Rate Estimation and Heart Murmur Detection based on Phonocardiogram