SpeechSplit2.0: Unsupervised Speech Disentanglement for Voice Conversion
The paper discusses SpeechSplit2.0, an advancement on the original SpeechSplit model, focusing on unsupervised speech disentanglement aimed at improving voice conversion applications. Speech disentanglement involves decomposing complex speech signals into interpretable components such as content, rhythm, pitch, and timbre. These components are significant not only for voice conversion but also for automatic speech recognition, speech synthesis, and emotion analysis.
SpeechSplit and Its Limitations
SpeechSplit utilized multiple autoencoders to perform aspect-specific voice conversions, achieving disentanglement of speech into content, rhythm, pitch, and timbre. However, a notable limitation of SpeechSplit was its dependency on careful bottleneck tuning of autoencoders, which presented scalability and adaptability challenges since different datasets required different configurations. This bottleneck tuning was both labor-intensive and time-consuming, reducing the system's robustness and efficiency.
Introduction of SpeechSplit2.0
To address these limitations, the paper introduces SpeechSplit2.0, which introduces signal processing techniques to manage information flow into the autoencoder inputs effectively. This iteration eliminates the need for bottleneck tuning while ensuring that speech disentanglement performance is sustained. By modifying encoder inputs using pre-processing methods, SpeechSplit2.0 alleviates the dependence on architecture-specific tuning, making it more adaptable to various datasets while maintaining comparable conversion performance.
Methodology
The proposed methodology involves:
- Content Encoder Input: Pitch information is removed using a pitch smoother, leveraging the WORLD vocoder for signal analysis. The pitch smoother averages the F0 contour to eliminate pitch dynamics, and Vocal Tract Length Perturbation (VTLP) is utilized to change timbre by warping frequencies. These processes prepare the input for the content encoder, followed by random resampling to corrupt rhythm information, ensuring the disentanglement is focused on content.
- Rhythm Encoder Input: A spectral envelope derived through cepstral liftering of the real cepstrum is employed to encode rhythm cues exclusive of content, pitch, and timbre. This technique emphasizes preserving rhythm information while stripping away other aspects.
- Pitch Encoder Input: Maintains the minimalist input transformation, relying on normalization and random resampling. The pitch encoder does not require adaptation, facilitating a streamlined learning process without additional processing.
Evaluation and Results
The authors performed thorough evaluations comparing SpeechSplit2.0 with the original SpeechSplit model in configurations with both wide and narrow bottlenecks. Conversion ability was assessed subjectively through perception tests on Amazon Mechanical Turk and objectively with character error rate (CER) measurements. The results showed:
- Subjective Performance: SpeechSplit2.0 matches the conversion performance of carefully tuned SpeechSplit models across different aspects (pitch, rhythm, and timbre) irrespective of bottleneck widths, highlighting its robustness.
- Objective Performance: CER measurements indicated a trade-off with the innovation: while SpeechSplit2.0 achieved successful aspect conversion, naturalness, and intelligibility were slightly compromised due to signal processing artifacts. However, wider bottlenecks offered better intelligibility, suggesting potential flexibility in applications requiring higher precision.
Conclusion and Future Work
SpeechSplit2.0 advances unsupervised voice conversion by eliminating the constraints of architecture-tuned bottlenecks through the strategic use of front-end signal processing. This versatility provides robustness against changes in dataset dynamics, contributing valuable insights towards generalized voice conversion systems. Future research is suggested to further refine the pitch smoothing strategy and explore bottleneck mechanisms that maintain content while improving naturalness and intelligibility. Additionally, extending disentanglement to other speech aspects like emotion and accent could widen its applicability in broader domains.