MutterMeter: Dual Bioacoustic Detection

Updated 17 November 2025

MutterMeter is a dual-purpose system that automates self-talk detection with earable audio and heart murmur detection from phonocardiogram signals using deep learning.
It employs a multi-stage, hierarchical pipeline combining acoustic feature engineering, neural embeddings, and confidence-based early exits to optimize processing.
Empirical evaluations show improved Macro-F₁ scores and reduced latency, underscoring its potential for scalable, in situ bioacoustic monitoring.

MutterMeter refers to two distinct but technically related systems—one for automatic self-talk detection via wearable audio (“earables”) and another for state-of-the-art heart rate and murmur detection from phonocardiogram (PCG) signals using deep learning architectures. Each MutterMeter system leverages multi-stage pipelines that integrate acoustic feature engineering, neural embeddings, and hierarchical classification or multi-task architectures to address application-specific challenges involving audible murmurs, acoustic ambiguity, and real-world usability constraints.

1. Self-Talk Detection via Earables

MutterMeter, as described by Lee et al., designates a mobile system for automatic self-talk detection from real-world audio captured by in-ear microphones (earables) (Lee et al., 10 Nov 2025). Self-talk—defined as momentary, self-directed, often incomplete speech used for emotion regulation and cognitive processing—remains challenging to observe in situ due to its sporadic occurrence, low amplitude, and syntactic idiosyncrasies.

1.1 Technical Challenges

Self-talk detection departs from standard speech understanding along several axes:

Acoustic Diversity: Self-talk varies from nearly inaudible murmurs to emotionally charged exclamations.
Linguistic Incompleteness: Utterances exhibit grammatical fragmentation, repetition, and omission of standard constituents.
Temporal Irregularity: Unlike dialogue, self-talk has no structured turn-taking or predictable timing, often occurring in clusters under emotional stress.

Conventional approaches relying on self-reports or post-hoc human annotation are error-prone and unscalable, motivating an objective, real-time measurement solution.

2. System Architecture and Algorithmic Workflow

MutterMeter adopts a three-stage, hierarchical classification pipeline, employing confidence-gated early exits to reduce computational overhead, minimize latency, and adaptively fuse acoustic and linguistic logits (Lee et al., 10 Nov 2025).

2.1 Pipeline Stages

Preprocessing: Continuous 22 050 Hz audio streams are segmented by detecting vocal events (RMS threshold > –20 dB). Utterances longer than 300 ms with inter-frame gaps less than 800 ms are merged, and a rolling context cache (max 30 s) is maintained for context-aware transcription.

Acoustic Stage: Each utterance is embedded via an 80-channel log-mel spectrogram front end (STFT), encoded using a fine-tuned Whisper-base Transformer to yield high-dimensional features. The Locality-Aware Embedding Adaptation (LAEA) module smooths embeddings for utterances proximate in time (ΔT ≤ 4 s) using adaptive weighting:

$\alpha = \mathrm{MLP}([e_{\text{curr}}; e_{\text{prev}}]), \qquad e_{\text{adapted}} = \alpha e_{\text{curr}} + (1-\alpha) e_{\text{prev}}$

Classification proceeds via a three-layer MLP + softmax.

Linguistic Stage: Triggered if the acoustic prediction lacks sufficient confidence (margin $< \tau_a$ ) or for low-confidence ‘positive’ classes. Utterances aggregated from the context cache are transcribed with Whisper-large ASR to minimize WER (0.60 vs. 0.82 for single-utterance), then embedded through a fine-tuned Korean BERT encoder, and classified via a separate three-layer MLP.

Fusion Stage: If the linguistic margin is also low ( $< \tau_l$ ), acoustic and linguistic embeddings are projected and merged with gated fusion:

$g = \sigma(W_g [e_a; e_t] + b), \quad z = g \odot e_a + (1-g) \odot e_t$

Classification is performed over the fused representation with a five-layer MLP.

2.2 Confidence-Gated Early-Exit Logic

The system employs a confidence gating strategy to reduce unnecessary computation:

(y_a, m_a) = Acoustic(x)
if (y_a in {neg, other}) and m_a >= 0.92:
    return y_a
else:
    (y_l, m_l) = Linguistic(x)
    if (y_l == neg) and m_l >= 0.80:
        return y_l
    else:
        return Fusion(x)

Thresholds ( $\tau_a = 0.92$ , $\tau_l = 0.80$ ) are derived empirically from the least-margin distributions.

3. Dataset, Training, and Empirical Evaluation

3.1 Data Collection and Annotation

25 recreational/amateur tennis players (22 male, 3 female, mean experience 2.4 y)
In situ: Three matches each, open-ear wireless earphones, synchronized to video
31.1 hours total; 8 900 utterances (3.9 h), subdivided into negative (26%), positive (16%), and other (58%) self-talk, durations mean 2.0 s, 1.6 s, and 1.4 s
Labels: Assigned through manual triage using timestamped audio-video alignment by two raters

3.2 Model Training

Acoustic: Whisper-base, fine-tuned (batch 32, AdamW, LR 2e–5, early stopping)
LAEA: MLP, Adam, LR 1e–4
Linguistic: Whisper-large ASR + fine-tuned Korean BERT (batch 64, LR 2e–5)
Fusion: batch 64, Adam, LR 1e–4
Validation: Leave-One-Subject-Out (LOSO) cross-validation

3.3 Performance Metrics and Baselines

Overall Macro-F₁ (test, LOSO): | Model | Macro-F₁ | |------------------|----------| | Acoustic only | 0.80 | | Linguistic only | 0.70 | | Fusion only | 0.82 | | MutterMeter | 0.84 |

By class: Negative: F₁=0.84, Positive: F₁=0.77, Other: F₁=0.91
Baseline comparison: LLM- and SER-based methods (Emotion2Vec, TweetNLP, Gemini) all scored substantially below MutterMeter (≤0.69)
Ablations: LAEA and adaptive fusion yield +5–10% F₁ improvement on positive self-talk and overall

3.4 System Latency

Preprocessing: 21 ms per 100 ms audio
Acoustic: 2 015 ms per utterance (majority exit)
Linguistic: 4 298 ms (minority invoked)
Fusion: 0.8 ms
Hierarchical early exits: Reduce average utterance latency from 6 335 ms to 3 713 ms (–41%)

4. Limitations and Areas for Improvement

Noise Sensitivity: Errors driven by low SNR or overlapping background audio; ambiguous utterances without strong linguistic or prosodic cues may confound classification.
ASR Reliability: Short or mumbled utterances are transcription-challenged, reducing the effectiveness of the linguistic stage.
Computation: Acoustic stage incurs ~2 s per utterance latency on-device; continuous operation may impact battery life. ASR offload reduces local computation but introduces privacy and connectivity dependencies.

Potential enhancements identified include deployment of pruned or lightweight ASR on-device, user adaptation via speaker-specific fine-tuning, integration of non-audio physiological signals (accelerometer, PPG), support for silent self-talk via bone conduction or EOG, and generalization to non-sports domains (Lee et al., 10 Nov 2025).

5. Heart Rate and Murmur Detection from PCG: Relation to MutterMeter

A second line of research led by Nie et al. references “MutterMeter” as a framework for combined heart rate estimation and murmur detection in digital stethoscope use cases (Nie et al., 25 Jul 2024). Here, high-fidelity heart sounds (PCG, resampled to 16 kHz, filtered at 2 kHz) are segmented using annotated S1/S2 boundaries. Features extracted per 5 s snippet include Mel spectrogram (40 bands), MFCC (40 coeffs), power spectral density (PSD, Welch), and RMS energy.

5.1 2D-CNN and Multi-Task Learning

Input: Four-channel feature “images” (Mel∥MFCC∥PSD∥RMS, shape 4×time×freq)
CNN Architecture: Five convolutional layers (3×3, ReLU, max-pool), followed by a flattening layer and two parallel heads: one for heart rate classification (141-class, 40–180 bpm), one for murmur detection (binary, sigmoid)
Training: Weighted cross-entropy for HR, weighted BCE for murmur, Adam optimizer (initial LR=1e–3, batch 16, 100 epochs, learning-rate schedulers triggered by val MAE < 2 bpm)

5.2 Performance and Benchmarks

Label	MAE (bpm) / ACC (%)
PSD-only baseline	8.28 (HR, MAE)
TCNN-LSTM baseline	1.63 (HR, MAE)
2D-CNN (single-task)	1.312 (HR, MAE)
2D-CNN-MTL (multi)	1.636 (HR, MAE); 97.5 (Murmur ACC)

Performance for both heart rate and murmur detection surpasses the AAMI requirement (≤10% or ≤5 bpm MAE). All four features yield the best results versus any subset. Notable limitations include lack of explicit noise reduction, drop in performance on extreme HR, requirement for 5 s windows (3 s increases error to ~3.3 bpm), and the treatment of HR as classification rather than regression.

6. Broader Significance and Application Domains

The MutterMeter systems collectively establish end-to-end pipelines for automatic physiological and psychological state measurement—either through murmured self-talk or acoustic cardiology. Both approaches emphasize:

The need for robust acoustic feature engineering coupled with neural representation learning.
The value of hierarchical or multi-task classification to mediate ambiguous, context-dependent audio.
Practical trade-offs (on-device versus offloaded computation, latency, and battery demand) that condition real-world feasibility.

A plausible implication is that architectural motifs, such as hierarchical gating and multi-modal fusion, are likely extensible to other in situ bio-acoustic monitoring domains where events are of ambiguous or irregular structure. Additionally, the use of sliding window segmentation, adaptive fusion, and context caches emerges as a recurrent theme in the technical design of both MutterMeter variants.

7. References

PDF Markdown Chat (Pro)

References (2)

Enabling Automatic Self-Talk Detection via Earables (2025)

Model-driven Heart Rate Estimation and Heart Murmur Detection based on Phonocardiogram (2024)

MutterMeter: Dual Bioacoustic Detection

1. Self-Talk Detection via Earables

1.1 Technical Challenges

2. System Architecture and Algorithmic Workflow

2.1 Pipeline Stages

2.2 Confidence-Gated Early-Exit Logic

3. Dataset, Training, and Empirical Evaluation

3.1 Data Collection and Annotation

3.2 Model Training

3.3 Performance Metrics and Baselines

3.4 System Latency

4. Limitations and Areas for Improvement

5. Heart Rate and Murmur Detection from PCG: Relation to MutterMeter

5.1 2D-CNN and Multi-Task Learning

5.2 Performance and Benchmarks

6. Broader Significance and Application Domains

7. References

Whiteboard

Follow Topic

Continue Learning

MutterMeter: Dual Bioacoustic Detection

1. Self-Talk Detection via Earables

1.1 Technical Challenges

2. System Architecture and Algorithmic Workflow

2.1 Pipeline Stages

2.2 Confidence-Gated Early-Exit Logic

3. Dataset, Training, and Empirical Evaluation

3.1 Data Collection and Annotation

3.2 Model Training

3.3 Performance Metrics and Baselines

3.4 System Latency

4. Limitations and Areas for Improvement

5. Heart Rate and Murmur Detection from PCG: Relation to MutterMeter

5.1 2D-CNN and Multi-Task Learning

5.2 Performance and Benchmarks

6. Broader Significance and Application Domains

7. References

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics