S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations (2104.02901v2)

Published 7 Apr 2021 in eess.AS and cs.SD

Abstract: Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. FragmentVC utilizes two encoders to encode source and target information and adopts cross attention to align the source and target features with similar phonetic content. Moreover, pre-trained features are adopted. AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2.0 is used in FragmentVC to extract the phonetic content information. Different from previous works, we proposed S2VC that utilizes Self-Supervised features as both source and target features for VC model. Supervised phoneme posteriororgram (PPG), which is believed to be speaker-independent and widely used in VC to extract content information, is chosen as a strong baseline for SSL features. The objective evaluation and subjective evaluation both show models taking SSL feature CPC as both source and target features outperforms that taking PPG as source feature, suggesting that SSL features have great potential in improving VC.

PDF Abstract

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

The paper introduces a novel framework for any-to-any voice conversion (VC), termed S2VC. This paper ventures into the integration of self-supervised learning (SSL) techniques within the domain of voice conversion, a stringent task that involves converting the timbre of a source speaker to that of a target speaker without reliance on parallel data and across seen or unseen speakers. Previous methodologies such as AUTO VC, AdaINVC, and FragmentVC, have primarily focused on disentangling content and speaker information through the use of encoders and pretrained features, but each has its own limitations particularly in unseen speaker scenarios.

Methodology and Framework

The proposed S2VC framework advances the fragmentation approach seen in FragmentVC, employing SSL features both as source and target inputs. The architecture consists of a source encoder, target encoder, cross-attention modules, and a decoder. SSL representations such as Autoregressive Predictive Coding (APC), Contrastive Predictive Coding (CPC), and wav2vec 2.0 are modality of choice for representation extraction. These pretrained models are favored owing to their ability to capture nuanced phonetic and speaker-dependent characteristics from speech corpora, leveraging large-scale unlabeled data, thus obliterating necessity for extensive manual annotation.

Experimental Setup and Performance

The authors conducted meticulous experiments using the CSTR VCTK dataset and evaluated the performance in both scenarios where speakers were seen during training (s2s) and those that were not (u2u). Objective evaluation metrics included MOSNet predictions for audio quality and a speaker verification (SV) system to ascertain speaker similarity. Notably, it was found that using CPC as both source and target features yielded superior results in both these metrics, outperforming traditional representations such as PPG and even wav2vec 2.0.

Subjective evaluations further corroborate these findings, with CPC+CPC configurations demonstrating enhanced perceptual quality and speaker similarity compared to competing models. The ablation studies within this research illuminate the pivotal role of self-attention pooling, the attention information bottleneck, and instance normalization, underscoring their collective utility in refining cross-attention alignment for exemplifying robust feature mappings.

Implications and Future Directions

The findings from this paper underscore a significant step forward in the use of SSL features within any-to-any voice conversion contexts, especially with unseen speakers—a haLLMark challenge for pre-existing models. From both practical and theoretical lenses, S2VC paves the way for more adaptive, robust, and efficient voice conversion strategies. Given its modularity and flexible architecture, the framework holds promise for future integration of diverse SSL features for enhanced VC efficacy.

As AI and SSL techniques evolve, especially those pertaining to self-supervised methodologies, it is anticipated that future research will blend multiple SSL representations for richer, more comprehensive feature abstraction, potentially addressing the vicissitudes of unseen speaker adaptation and dynamic environmental contexts within VC tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jheng-hao Lin (5 papers)
Yist Y. Lin (8 papers)
Chung-Ming Chien (13 papers)
Hung-yi Lee (327 papers)

Citations (53)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - s3prl/s3prl: Self-Supervised Speech Pre-training and Representation Learning Toolkit (2,419 stars)