Confidence-Based Filtering for Speech Enhancement

Updated 22 January 2026

Confidence-based filtering is a non-intrusive method that leverages internal model confidence measures from token outputs to assess speech quality.
It computes aggregated token-level log probabilities to identify subtle hallucinations such as phoneme omissions, speaker inconsistencies, and content corruption.
The approach enhances downstream TTS performance by curating high-fidelity speech datasets while balancing the trade-off between quality and data volume.

Confidence-based filtering is a non-intrusive error mitigation paradigm that uses internal confidence measures derived from generative speech enhancement (GSE) models—specifically those operating in a discrete tokenization regime—to identify and remove outputs likely to exhibit hallucination errors. This approach addresses a major challenge in curating large-scale speech corpora enhanced by GSE models, where traditional non-intrusive metrics such as estimated mean-opinion-scores (MOS) or automatic speech recognition (ASR) confidence often fail to detect subtle content-level or speaker-consistency errors inherent to generative modeling of speech (Yamauchi et al., 18 Jan 2026).

1. Background and Motivation

Generative speech enhancement models, especially those based on neural discrete tokenization (e.g., DAC, RVQ-based audio codecs), can regenerate high-quality “clean” speech from noisy input via a learned conditional distribution over token sequences (Yamauchi et al., 18 Jan 2026). However, these models may hallucinate: omitting phonemes, inserting spurious content, or drifting speaker identity. Such errors are correlated with low semantic or acoustic faithfulness but may still yield acoustically plausible outputs, passing standard non-intrusive perceptual metrics.

Confidence-based filtering directly targets this failure of conventional filtering (i.e., UTMOS, DNSMOS, ASR-token confidence/CTC-score), seeking internally valid metrics that capture model uncertainty or self-assessment at the utterance level (Yamauchi et al., 18 Jan 2026).

2. Methodological Principles

The core methodology is to extract and aggregate token-level log-probabilities from the generative model during inference:

For a GSE model parameterized by $\theta$ and conditioned on input features $\mathbf c$ , generating a token sequence $\hat{\mathbf X} = \{\hat x_{t,1},...,\hat x_{t,K}\}$ ,

Token-level confidence (first RVQ layer):

$s_t = \log\,p\bigl(x_{t,1} = \hat x_{t,1} \mid \mathbf c; \theta\bigr)$

Utterance-level confidence:

$S_{\mathrm{utt}} = \frac{1}{T} \sum_{t=1}^T s_t$

A low $S_{\mathrm{utt}}$ flags utterances where the model is uncertain about its own output, indicating potential content or speaker hallucination.

Filtering proceeds by ranking utterances according to $S_{\mathrm{utt}}$ and discarding those below a threshold $\tau$ or a given acceptance rate, yielding a curated subset presumed to have higher semantic and acoustic fidelity (Yamauchi et al., 18 Jan 2026).

3. Hallucination Detection and Failure Modes of Baseline Metrics

Discrete token–based GSE models can err in three primary ways:

Phoneme omission: missing critical phonetic segments, reducing intelligibility.
Speaker inconsistency: timbral or identity drift, especially over long utterances or low-SNR regions.
Content corruption: spurious non-speech artifacts or wrong phoneme insertions.

Traditional metrics (e.g., UTMOS, DNSMOS, Whisper ASR token probability, CTC score) are agnostic to reference signals and rate acoustic naturalness or broad intelligibility. They are empirically demonstrated to assign high scores even to utterances with severe underlying hallucinations (e.g., missing phonemes or wrong speaker) (Yamauchi et al., 18 Jan 2026). Thus, reliance on these metrics results in poor filtering of problematic outputs when curating datasets for downstream applications like TTS.

4. Quantitative Effectiveness and Correlation Analysis

The Genhancer framework (Yamauchi et al., 18 Jan 2026) provides a comprehensive experimental validation:

Correlation analysis on EARS-WHAM:
- Genhancer confidence $S_{\mathrm{utt}}$ achieves Spearman’s $r=0.883$ with PESQ (signal fidelity) and $r=0.892$ with SpeechBERTScore (linguistic fidelity), outperforming DNSMOS, UTMOS, ASR-token, and CTC-based baselines.
Filtering efficacy:
- As acceptance rate is decreased (more aggressive filtering), the average intrusive metric (e.g., ESTOI, SI-SDR, PESQ, SpeechBERTScore, LPS, WER, Speaker similarity) increases much more steeply under Genhancer confidence-based selection than competing methods.
Downstream TTS task:
- Training TTS on datasets filtered by $S_{\mathrm{utt}}$ rather than DNSMOS/UTMOS yields higher naturalness (UTMOS up to 3.80 for top 80% acceptance) and lower WER (down to 18.14%) in the synthesized test set (Yamauchi et al., 18 Jan 2026).
- Aggressive filtering ( $<$ 10% retained) leads to diminishing returns due to data scarcity, suggesting a quality–quantity trade-off.

5. Operational Workflow

A prototypical confidence-based filtering pipeline comprises:

Enhance the entire corpus with a discrete-token GSE model (e.g., Genhancer).
For each utterance, compute $S_{\mathrm{utt}}$ using the model’s generation-time token log-probabilities.
Rank or threshold outputs according to $S_{\mathrm{utt}}$ , discarding low-confidence utterances.
Assemble the filtered set for downstream training or evaluation (e.g., TTS model training).

No clean reference is required at any stage, and the method is applicable even in in-the-wild settings (Yamauchi et al., 18 Jan 2026).

Confidence-based filtering differs fundamentally from post-hoc ASR evaluation or black-box MOS estimation by leveraging generative model’s own uncertainty signals, which are highly predictive of both non-intrusive (DNSMOS, UTMOS) and intrusive (PESQ, ESTOI, SpeechBERTScore, LPS, word/speaker similarity) metrics (Yamauchi et al., 18 Jan 2026). It fills a critical gap by identifying hallucinations that are invisible to both non-intrusive scoring and ASR-based filtering.

Although the current paradigm is instantiated in discrete token–based models, the concept is extendable to continuous-latent GSE frameworks. For instance, one may adapt token confidence estimation to likelihoods or “logit-confidence” in diffusion or flow-based GSE (Yamauchi et al., 18 Jan 2026).

7. Practical Implications, Limitations, and Future Extensions

Applying confidence-based filtering to generatively enhanced speech data yields demonstrable benefits in downstream synthesis tasks, providing both higher subjective naturalness and objective intelligibility, all without requiring reference clean speech (Yamauchi et al., 18 Jan 2026).

Potential limitations include:

The restriction to discrete-token GSE models, limiting direct applicability to continuous-latent methods (though extension is plausible).
The need to carefully tune the filtering threshold to balance data quality and quantity—over-filtering reduces training data volume and may hurt robustness.
Assumption that token log-probability is a reliable indicator in all generative architectures; this may require revalidation if new model designs emerge.

Future work may involve:

Extending the core confidence estimation concept to diffusion or flow-matching GSE (e.g., using negative log-likelihood or logit-margin as a substitute).
Incorporating adaptive thresholding based on downstream metric feedback.
Applying confidence-based filtering in more diverse and truly low-resource settings, including multilingual and cross-domain scenarios (Yamauchi et al., 18 Jan 2026).

References:

"Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens" (Yamauchi et al., 18 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (1)

Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Confidence-Based Filtering.