Do EER/WER successes predict downstream speech generation performance?

Determine whether speaker anonymization systems that achieve strong performance according to equal error rate (EER) and word error rate (WER) metrics also excel when their anonymized speech is used as training data for downstream speech generation tasks.

Background

Speaker anonymization research, particularly within the VoicePrivacy Challenge frameworks, primarily evaluates systems using equal error rate (EER) for privacy and word error rate (WER) for utility. These metrics assess anonymization quality against automatic speaker verification and automatic speech recognition systems, respectively.

However, whether strong performance on these metrics translates to effective training data for generative speech models (e.g., text-to-speech) has not been established. Understanding this relationship is crucial for safely leveraging anonymized datasets to train large-scale speech generation systems without compromising privacy.

References

On the other hand, evaluating these SA systems in the context of speech generation model training has not yet been investigated, and it is unknown whether an SA system that performs well in terms of EER and WER can also excel in the downstream speech generation task.

— Multi-speaker Text-to-speech Training with Speaker Anonymized Data (2405.11767 - Huang et al., 2024) in Section 1 (Introduction)

Do EER/WER successes predict downstream speech generation performance?

Background

References

Related Problems