- The paper introduces the NaturalVoices dataset with over 3,800 hours of spontaneous and emotional speech, setting a new standard for voice conversion research.
- It presents an automatic processing pipeline that integrates cutting-edge deep learning techniques for diarization, ASR, speaker recognition, and emotion detection.
- Experimental results show enhanced speaker similarity (93.65% seen-to-seen) and high MOS ratings, confirming the dataset’s effectiveness for realistic VC applications.
Overview of "Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline"
The authors of "Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline" introduce a novel dataset, NaturalVoices, and an accompanying automatic processing pipeline designed to enhance the quality and naturalness of voice conversion (VC) research. This work addresses a critical deficiency in existing VC datasets, which predominantly comprise structured or acted speech and fail to capture the spontaneity and diversity of real-life conversations.
Key Contributions
- NaturalVoices Dataset:
- Comprising over 3,800 hours of spontaneous, expressive, and emotional speech extracted from raw podcast data within the MSP-Podcast dataset.
- This dataset notably captures rich emotional expressions, nonverbal vocal cues, and various background sounds, thus providing a realistic foundation for developing VC models.
- Automatic Data-Sourcing Pipeline:
- The authors propose a pipeline that utilizes state-of-the-art deep learning techniques across multiple speech tasks including diarization, ASR, speaker recognition, and speech emotion recognition.
- The pipeline extracts and annotates key information such as transcripts, speaker details, SNR, emotion attributes, and a broad range of sound events, demonstrating flexibility and ease of data filtering for diverse applications.
Analysis and Evaluations
Comparison with Existing Datasets
- VCTK and ESD Datasets: A comparative analysis demonstrates that NaturalVoices surpasses these datasets in scale and expressiveness, offering a significantly larger and more varied corpus. Unlike the VCTK and ESD datasets, NaturalVoices includes spontaneous speech and is annotated with additional metadata such as emotion categories and SNR levels.
Emotion Distribution and SNR Analysis
- The dataset is differentiated through its broad distribution across the emotional spectrum (arousal, dominance, valence) and higher prevalence of natural emotional expressions, as represented in Figures 1 and 2.
- SNR distribution analysis reveals NaturalVoices' capability to capture a variety of recording conditions, from challenging noisy environments to high-quality clear speech, making it a versatile dataset for developing robust VC models.
Experimental Results
VC experiments were conducted using the TriAANVC model, trained on both the NaturalVoices and VCTK datasets for comparison.
- Objective Evaluations:
- Speaker Verification (SV): NaturalVoices achieved higher speaker similarity scores (93.65% seen-to-seen) compared to VCTK.
- Word Error Rate (WER) and Character Error Rate (CER): Performances on these metrics were comparable, affirming that NaturalVoices can sustain the intelligibility of speech content.
- Subjective Evaluations:
- Mean Opinion Score (MOS): Evaluations indicated that speech generated with NaturalVoices was rated highly on quality and intelligibility, reinforcing the practicality of the dataset for various speech synthesis and VC tasks.
Practical Implications and Future Directions
NaturalVoices opens new avenues for:
- Expressive Speech Synthesis: By providing a richly annotated, naturalistic speech dataset, it supports advanced expressive and emotional VC models.
- Noise Robustness: The dataset’s varied SNR levels promote the development of VC systems capable of handling diverse acoustic environments, enhancing real-world applicability.
- Modeling Spontaneous Speech: The spontaneous nature of the dataset aids in improving models for spontaneous speech generation and understanding.
Future work could explore extending the dataset to further enrich its emotional expressions and conversational dynamics, alongside leveraging the automatic processing pipeline for continuous dataset augmentation and refinement.
Conclusion
The NaturalVoices dataset, coupled with its innovative data-sourcing pipeline, offers a substantial advancement in VC research by providing a large-scale, spontaneous, and richly annotated speech corpus. This work significantly enhances the potential for developing VC systems that generate more natural, intelligible, and expressive speech. As such, NaturalVoices is poised to become a cornerstone resource for the next generation of speech synthesis and voice conversion technologies.