Confidence-Based Filtering for Speech Enhancement
- Confidence-based filtering is a non-intrusive method that leverages internal model confidence measures from token outputs to assess speech quality.
- It computes aggregated token-level log probabilities to identify subtle hallucinations such as phoneme omissions, speaker inconsistencies, and content corruption.
- The approach enhances downstream TTS performance by curating high-fidelity speech datasets while balancing the trade-off between quality and data volume.
Confidence-based filtering is a non-intrusive error mitigation paradigm that uses internal confidence measures derived from generative speech enhancement (GSE) models—specifically those operating in a discrete tokenization regime—to identify and remove outputs likely to exhibit hallucination errors. This approach addresses a major challenge in curating large-scale speech corpora enhanced by GSE models, where traditional non-intrusive metrics such as estimated mean-opinion-scores (MOS) or automatic speech recognition (ASR) confidence often fail to detect subtle content-level or speaker-consistency errors inherent to generative modeling of speech (Yamauchi et al., 18 Jan 2026).
1. Background and Motivation
Generative speech enhancement models, especially those based on neural discrete tokenization (e.g., DAC, RVQ-based audio codecs), can regenerate high-quality “clean” speech from noisy input via a learned conditional distribution over token sequences (Yamauchi et al., 18 Jan 2026). However, these models may hallucinate: omitting phonemes, inserting spurious content, or drifting speaker identity. Such errors are correlated with low semantic or acoustic faithfulness but may still yield acoustically plausible outputs, passing standard non-intrusive perceptual metrics.
Confidence-based filtering directly targets this failure of conventional filtering (i.e., UTMOS, DNSMOS, ASR-token confidence/CTC-score), seeking internally valid metrics that capture model uncertainty or self-assessment at the utterance level (Yamauchi et al., 18 Jan 2026).
2. Methodological Principles
The core methodology is to extract and aggregate token-level log-probabilities from the generative model during inference:
For a GSE model parameterized by and conditioned on input features , generating a token sequence ,
- Token-level confidence (first RVQ layer):
- Utterance-level confidence:
A low flags utterances where the model is uncertain about its own output, indicating potential content or speaker hallucination.
Filtering proceeds by ranking utterances according to and discarding those below a threshold or a given acceptance rate, yielding a curated subset presumed to have higher semantic and acoustic fidelity (Yamauchi et al., 18 Jan 2026).
3. Hallucination Detection and Failure Modes of Baseline Metrics
Discrete token–based GSE models can err in three primary ways:
- Phoneme omission: missing critical phonetic segments, reducing intelligibility.
- Speaker inconsistency: timbral or identity drift, especially over long utterances or low-SNR regions.
- Content corruption: spurious non-speech artifacts or wrong phoneme insertions.
Traditional metrics (e.g., UTMOS, DNSMOS, Whisper ASR token probability, CTC score) are agnostic to reference signals and rate acoustic naturalness or broad intelligibility. They are empirically demonstrated to assign high scores even to utterances with severe underlying hallucinations (e.g., missing phonemes or wrong speaker) (Yamauchi et al., 18 Jan 2026). Thus, reliance on these metrics results in poor filtering of problematic outputs when curating datasets for downstream applications like TTS.
4. Quantitative Effectiveness and Correlation Analysis
The Genhancer framework (Yamauchi et al., 18 Jan 2026) provides a comprehensive experimental validation:
- Correlation analysis on EARS-WHAM:
- Genhancer confidence achieves Spearman’s with PESQ (signal fidelity) and with SpeechBERTScore (linguistic fidelity), outperforming DNSMOS, UTMOS, ASR-token, and CTC-based baselines.
- Filtering efficacy:
- As acceptance rate is decreased (more aggressive filtering), the average intrusive metric (e.g., ESTOI, SI-SDR, PESQ, SpeechBERTScore, LPS, WER, Speaker similarity) increases much more steeply under Genhancer confidence-based selection than competing methods.
- Downstream TTS task:
- Training TTS on datasets filtered by rather than DNSMOS/UTMOS yields higher naturalness (UTMOS up to 3.80 for top 80% acceptance) and lower WER (down to 18.14%) in the synthesized test set (Yamauchi et al., 18 Jan 2026).
- Aggressive filtering (10% retained) leads to diminishing returns due to data scarcity, suggesting a quality–quantity trade-off.
5. Operational Workflow
A prototypical confidence-based filtering pipeline comprises:
- Enhance the entire corpus with a discrete-token GSE model (e.g., Genhancer).
- For each utterance, compute using the model’s generation-time token log-probabilities.
- Rank or threshold outputs according to , discarding low-confidence utterances.
- Assemble the filtered set for downstream training or evaluation (e.g., TTS model training).
No clean reference is required at any stage, and the method is applicable even in in-the-wild settings (Yamauchi et al., 18 Jan 2026).
6. Comparison with Related Paradigms and Generalization
Confidence-based filtering differs fundamentally from post-hoc ASR evaluation or black-box MOS estimation by leveraging generative model’s own uncertainty signals, which are highly predictive of both non-intrusive (DNSMOS, UTMOS) and intrusive (PESQ, ESTOI, SpeechBERTScore, LPS, word/speaker similarity) metrics (Yamauchi et al., 18 Jan 2026). It fills a critical gap by identifying hallucinations that are invisible to both non-intrusive scoring and ASR-based filtering.
Although the current paradigm is instantiated in discrete token–based models, the concept is extendable to continuous-latent GSE frameworks. For instance, one may adapt token confidence estimation to likelihoods or “logit-confidence” in diffusion or flow-based GSE (Yamauchi et al., 18 Jan 2026).
7. Practical Implications, Limitations, and Future Extensions
Applying confidence-based filtering to generatively enhanced speech data yields demonstrable benefits in downstream synthesis tasks, providing both higher subjective naturalness and objective intelligibility, all without requiring reference clean speech (Yamauchi et al., 18 Jan 2026).
Potential limitations include:
- The restriction to discrete-token GSE models, limiting direct applicability to continuous-latent methods (though extension is plausible).
- The need to carefully tune the filtering threshold to balance data quality and quantity—over-filtering reduces training data volume and may hurt robustness.
- Assumption that token log-probability is a reliable indicator in all generative architectures; this may require revalidation if new model designs emerge.
Future work may involve:
- Extending the core confidence estimation concept to diffusion or flow-matching GSE (e.g., using negative log-likelihood or logit-margin as a substitute).
- Incorporating adaptive thresholding based on downstream metric feedback.
- Applying confidence-based filtering in more diverse and truly low-resource settings, including multilingual and cross-domain scenarios (Yamauchi et al., 18 Jan 2026).
References:
- "Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens" (Yamauchi et al., 18 Jan 2026)