Papers
Topics
Authors
Recent
2000 character limit reached

MutterMeter: Dual Bioacoustic Detection

Updated 17 November 2025
  • MutterMeter is a dual-purpose system that automates self-talk detection with earable audio and heart murmur detection from phonocardiogram signals using deep learning.
  • It employs a multi-stage, hierarchical pipeline combining acoustic feature engineering, neural embeddings, and confidence-based early exits to optimize processing.
  • Empirical evaluations show improved Macro-F₁ scores and reduced latency, underscoring its potential for scalable, in situ bioacoustic monitoring.

MutterMeter refers to two distinct but technically related systems—one for automatic self-talk detection via wearable audio (“earables”) and another for state-of-the-art heart rate and murmur detection from phonocardiogram (PCG) signals using deep learning architectures. Each MutterMeter system leverages multi-stage pipelines that integrate acoustic feature engineering, neural embeddings, and hierarchical classification or multi-task architectures to address application-specific challenges involving audible murmurs, acoustic ambiguity, and real-world usability constraints.

1. Self-Talk Detection via Earables

MutterMeter, as described by Lee et al., designates a mobile system for automatic self-talk detection from real-world audio captured by in-ear microphones (earables) (Lee et al., 10 Nov 2025). Self-talk—defined as momentary, self-directed, often incomplete speech used for emotion regulation and cognitive processing—remains challenging to observe in situ due to its sporadic occurrence, low amplitude, and syntactic idiosyncrasies.

1.1 Technical Challenges

Self-talk detection departs from standard speech understanding along several axes:

  • Acoustic Diversity: Self-talk varies from nearly inaudible murmurs to emotionally charged exclamations.
  • Linguistic Incompleteness: Utterances exhibit grammatical fragmentation, repetition, and omission of standard constituents.
  • Temporal Irregularity: Unlike dialogue, self-talk has no structured turn-taking or predictable timing, often occurring in clusters under emotional stress.

Conventional approaches relying on self-reports or post-hoc human annotation are error-prone and unscalable, motivating an objective, real-time measurement solution.

2. System Architecture and Algorithmic Workflow

MutterMeter adopts a three-stage, hierarchical classification pipeline, employing confidence-gated early exits to reduce computational overhead, minimize latency, and adaptively fuse acoustic and linguistic logits (Lee et al., 10 Nov 2025).

2.1 Pipeline Stages

Preprocessing: Continuous 22 050 Hz audio streams are segmented by detecting vocal events (RMS threshold > –20 dB). Utterances longer than 300 ms with inter-frame gaps less than 800 ms are merged, and a rolling context cache (max 30 s) is maintained for context-aware transcription.

Acoustic Stage: Each utterance is embedded via an 80-channel log-mel spectrogram front end (STFT), encoded using a fine-tuned Whisper-base Transformer to yield high-dimensional features. The Locality-Aware Embedding Adaptation (LAEA) module smooths embeddings for utterances proximate in time (ΔT ≤ 4 s) using adaptive weighting:

α=MLP([ecurr;eprev]),eadapted=αecurr+(1α)eprev\alpha = \mathrm{MLP}([e_{\text{curr}}; e_{\text{prev}}]), \qquad e_{\text{adapted}} = \alpha e_{\text{curr}} + (1-\alpha) e_{\text{prev}}

Classification proceeds via a three-layer MLP + softmax.

Linguistic Stage: Triggered if the acoustic prediction lacks sufficient confidence (margin <τa< \tau_a) or for low-confidence ‘positive’ classes. Utterances aggregated from the context cache are transcribed with Whisper-large ASR to minimize WER (0.60 vs. 0.82 for single-utterance), then embedded through a fine-tuned Korean BERT encoder, and classified via a separate three-layer MLP.

Fusion Stage: If the linguistic margin is also low (<τl< \tau_l), acoustic and linguistic embeddings are projected and merged with gated fusion:

g=σ(Wg[ea;et]+b),z=gea+(1g)etg = \sigma(W_g [e_a; e_t] + b), \quad z = g \odot e_a + (1-g) \odot e_t

Classification is performed over the fused representation with a five-layer MLP.

2.2 Confidence-Gated Early-Exit Logic

The system employs a confidence gating strategy to reduce unnecessary computation:

1
2
3
4
5
6
7
8
9
(y_a, m_a) = Acoustic(x)
if (y_a in {neg, other}) and m_a >= 0.92:
    return y_a
else:
    (y_l, m_l) = Linguistic(x)
    if (y_l == neg) and m_l >= 0.80:
        return y_l
    else:
        return Fusion(x)

Thresholds (τa=0.92\tau_a = 0.92, τl=0.80\tau_l = 0.80) are derived empirically from the least-margin distributions.

3. Dataset, Training, and Empirical Evaluation

3.1 Data Collection and Annotation

  • 25 recreational/amateur tennis players (22 male, 3 female, mean experience 2.4 y)
  • In situ: Three matches each, open-ear wireless earphones, synchronized to video
  • 31.1 hours total; 8 900 utterances (3.9 h), subdivided into negative (26%), positive (16%), and other (58%) self-talk, durations mean 2.0 s, 1.6 s, and 1.4 s
  • Labels: Assigned through manual triage using timestamped audio-video alignment by two raters

3.2 Model Training

  • Acoustic: Whisper-base, fine-tuned (batch 32, AdamW, LR 2e–5, early stopping)
  • LAEA: MLP, Adam, LR 1e–4
  • Linguistic: Whisper-large ASR + fine-tuned Korean BERT (batch 64, LR 2e–5)
  • Fusion: batch 64, Adam, LR 1e–4
  • Validation: Leave-One-Subject-Out (LOSO) cross-validation

3.3 Performance Metrics and Baselines

Overall Macro-F₁ (test, LOSO): | Model | Macro-F₁ | |------------------|----------| | Acoustic only | 0.80 | | Linguistic only | 0.70 | | Fusion only | 0.82 | | MutterMeter | 0.84 |

  • By class: Negative: F₁=0.84, Positive: F₁=0.77, Other: F₁=0.91
  • Baseline comparison: LLM- and SER-based methods (Emotion2Vec, TweetNLP, Gemini) all scored substantially below MutterMeter (≤0.69)
  • Ablations: LAEA and adaptive fusion yield +5–10% F₁ improvement on positive self-talk and overall

3.4 System Latency

  • Preprocessing: 21 ms per 100 ms audio
  • Acoustic: 2 015 ms per utterance (majority exit)
  • Linguistic: 4 298 ms (minority invoked)
  • Fusion: 0.8 ms
  • Hierarchical early exits: Reduce average utterance latency from 6 335 ms to 3 713 ms (–41%)

4. Limitations and Areas for Improvement

  • Noise Sensitivity: Errors driven by low SNR or overlapping background audio; ambiguous utterances without strong linguistic or prosodic cues may confound classification.
  • ASR Reliability: Short or mumbled utterances are transcription-challenged, reducing the effectiveness of the linguistic stage.
  • Computation: Acoustic stage incurs ~2 s per utterance latency on-device; continuous operation may impact battery life. ASR offload reduces local computation but introduces privacy and connectivity dependencies.

Potential enhancements identified include deployment of pruned or lightweight ASR on-device, user adaptation via speaker-specific fine-tuning, integration of non-audio physiological signals (accelerometer, PPG), support for silent self-talk via bone conduction or EOG, and generalization to non-sports domains (Lee et al., 10 Nov 2025).

5. Heart Rate and Murmur Detection from PCG: Relation to MutterMeter

A second line of research led by Nie et al. references “MutterMeter” as a framework for combined heart rate estimation and murmur detection in digital stethoscope use cases (Nie et al., 25 Jul 2024). Here, high-fidelity heart sounds (PCG, resampled to 16 kHz, filtered at 2 kHz) are segmented using annotated S1/S2 boundaries. Features extracted per 5 s snippet include Mel spectrogram (40 bands), MFCC (40 coeffs), power spectral density (PSD, Welch), and RMS energy.

5.1 2D-CNN and Multi-Task Learning

  • Input: Four-channel feature “images” (Mel∥MFCC∥PSD∥RMS, shape 4×time×freq)
  • CNN Architecture: Five convolutional layers (3×3, ReLU, max-pool), followed by a flattening layer and two parallel heads: one for heart rate classification (141-class, 40–180 bpm), one for murmur detection (binary, sigmoid)
  • Training: Weighted cross-entropy for HR, weighted BCE for murmur, Adam optimizer (initial LR=1e–3, batch 16, 100 epochs, learning-rate schedulers triggered by val MAE < 2 bpm)

5.2 Performance and Benchmarks

Label MAE (bpm) / ACC (%)
PSD-only baseline 8.28 (HR, MAE)
TCNN-LSTM baseline 1.63 (HR, MAE)
2D-CNN (single-task) 1.312 (HR, MAE)
2D-CNN-MTL (multi) 1.636 (HR, MAE); 97.5 (Murmur ACC)

Performance for both heart rate and murmur detection surpasses the AAMI requirement (≤10% or ≤5 bpm MAE). All four features yield the best results versus any subset. Notable limitations include lack of explicit noise reduction, drop in performance on extreme HR, requirement for 5 s windows (3 s increases error to ~3.3 bpm), and the treatment of HR as classification rather than regression.

6. Broader Significance and Application Domains

The MutterMeter systems collectively establish end-to-end pipelines for automatic physiological and psychological state measurement—either through murmured self-talk or acoustic cardiology. Both approaches emphasize:

  • The need for robust acoustic feature engineering coupled with neural representation learning.
  • The value of hierarchical or multi-task classification to mediate ambiguous, context-dependent audio.
  • Practical trade-offs (on-device versus offloaded computation, latency, and battery demand) that condition real-world feasibility.

A plausible implication is that architectural motifs, such as hierarchical gating and multi-modal fusion, are likely extensible to other in situ bio-acoustic monitoring domains where events are of ambiguous or irregular structure. Additionally, the use of sliding window segmentation, adaptive fusion, and context caches emerges as a recurrent theme in the technical design of both MutterMeter variants.

7. References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MutterMeter.