Training-Free Voice Conversion with Factorized Optimal Transport (2506.09709v1)

Published 11 Jun 2025 in cs.SD, cs.CV, cs.LG, and eess.AS

Abstract: This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.

Summary

The paper introduces a training-free approach, Factorized MKL-VC, which leverages factorized optimal transport to overcome limitations of kNN-VC, especially with short reference audio.
It partitions high-dimensional WavLM embeddings into K-dimensional groups and applies Monge-Kantorovich Linear maps to efficiently align source and reference distributions.
Experimental results show improved content preservation and speaker similarity in both intra- and cross-lingual settings, making it especially effective for low-resource applications.

This paper (2506.09709) introduces Factorized MKL-VC, a training-free approach for any-to-any voice conversion (VC) that modifies the kNN-VC pipeline. The goal of any-to-any VC is to alter the voice identity of a source speaker to match a reference speaker while preserving the original linguistic content. While kNN-VC [baas23_interspeech] achieved state-of-the-art results using k-nearest neighbors on WavLM embeddings with long reference audio (minutes), its performance degrades significantly with shorter references (under 1 minute) and can struggle with cross-lingual conversion by introducing language-specific pronunciation artifacts from the reference.

Factorized MKL-VC addresses these limitations by replacing the kNN regression step with a conversion based on optimal transport (OT) theory, specifically using a factorized version of the Monge-Kantorovich Linear (MKL) map. The method follows a standard "encoder-converter-vocoder" pipeline, using the pre-trained WavLM-Large model [chen2022wavlm] as the encoder and HiFi-GAN [kong2020hifi] as the vocoder. The conversion logic operates on the latent embeddings produced by WavLM.

The core observation underpinning MKL-VC is the structure of WavLM embeddings. The authors found that while the embeddings are high-dimensional (1024 dimensions), the variability across dimensions is non-uniform. Only a subset of dimensions exhibits significant standard deviation across time, primarily contributing to distance metrics like L2 or cosine similarity. The remaining dimensions, despite having low variance, are crucial for reconstruction quality and cannot be simply discarded. This non-uniform variance makes a straightforward application of standard optimal transport methods suboptimal, as the transport plan would largely ignore the low-variance dimensions.

To overcome this, Factorized MKL-VC employs a novel factorized optimal transport strategy. The process involves:

Sorting the WavLM embedding dimensions based on their standard deviation computed over a large dataset (e.g., LibriSpeech train-clean-100).
Partitioning the sorted dimensions into $N/K$ groups, each containing $K$ dimensions, where $N=1024$ is the total dimension and $K$ is the MKL dimension parameter.
Assuming that the distribution of embedding values within each $K$ -dimensional group can be approximated by a multivariate Gaussian distribution (an assumption supported by Wasserstein distance analysis in the paper).
For each group, computing the mean and covariance matrix for both the source and reference embedding segments.
Applying the Monge-Kantorovich Linear (MKL) map, which provides an analytical solution for optimal transport between two Gaussian distributions, independently within each $K$ -dimensional group. The MKL map $T^{(i)}$ for the $i$ -th group transforms a source embedding sub-vector $x^{(i)}$ to a target-like sub-vector $T^{(i)}(x^{(i)})$ using the means ( $\mu_1^{(i)}, \mu_2^{(i)}$ ) and covariance matrices ( $\Sigma_1^{(i)}, \Sigma_2^{(i)}$ ) of the source and reference distributions for that group:

$T^{(i)}(x^{(i)}) = \mu_2^{(i)} + (\Sigma_1^{(i)})^{-1/2} \left( (\Sigma_1^{(i)})^{1/2} \Sigma_2^{(i)} (\Sigma_1^{(i)})^{1/2} \right)^{1/2} (\Sigma_1^{(i)})^{-1/2} (x^{(i)} - \mu_1^{(i)})$
Concatenating the transformed sub-vectors from all groups to form the final converted embedding vector.

The factorized map $T$ is thus the direct product of these $K$ -dimensional MKL maps:

$T(x) = [T^{(1)}(x^{(1)}), \ldots, T^{(N/K)}(x^{(N/K)})]$

This factorized approach allows the optimal transport to operate effectively across all dimensions by handling them in smaller, more tractable groups, preserving information from both high- and low-variance components. The process requires only computing the mean and covariance of the source and reference embeddings, sorting dimensions (a one-time pre-computation), and applying the MKL map; no training is involved.

For practical implementation, you would:

Load the pre-trained WavLM-Large encoder and HiFi-GAN vocoder.
Compute the standard deviations of WavLM embedding dimensions across time on a large dataset (this needs to be done once). Sort dimensions based on these standard deviations.
For a given source audio and a reference audio (as short as 5-10 seconds):
- Encode both using WavLM to get sequences of embeddings.
- Compute the mean vector and covariance matrix for the source embeddings and the reference embeddings.
- Partition the mean vectors and covariance matrices according to the pre-computed sorted dimensions and the chosen block size $K$ .
- For each $K$ $K$ -dimensional block:
  - Compute the matrix square roots and inverses required for the MKL formula.
  - Construct the MKL transformation matrix and offset.
- Apply the computed MKL transformation to each source embedding vector block-wise.
- Concatenate the transformed blocks to get the converted embedding sequence.
- Decode the converted embedding sequence using HiFi-GAN to produce the output waveform.

The parameter $K$ is the main tuning knob for MKL-VC. A smaller $K$ (e.g., $K=2$ ) results in smaller Wasserstein distances to the Gaussian distribution approximation within each block, potentially leading to better content preservation (lower WER/CER). Increasing $K$ tends to improve speaker similarity (higher SIM) but can negatively impact content intelligibility. The paper suggests $K=2$ or $K=8$ provides a good balance.

Experimental results on LibriSpeech (intra-lingual) and FLEURS (cross-lingual German-French) datasets demonstrate MKL-VC's effectiveness with short reference audio (5-10s). Objective metrics show MKL-VC (particularly with $K=2$ ) achieving a better overall score than kNN-VC and competitive performance with FACodec [ju2024naturalspeech], a state-of-the-art model that requires significant training data or longer reference audio. Crucially, MKL-VC significantly improves content preservation (lower WER/CER) compared to kNN-VC with short references, while maintaining good speaker similarity.

For cross-lingual conversion, MKL-VC also performs comparably to FACodec. Subjective evaluations involving native speakers of various languages (including low-resource ones like Sakha) converting speech to Japanese reference indicate that MKL-VC produces natural-sounding speech without the "robotic voice" effect sometimes associated with other methods. MKL-VC and FACodec were generally preferred over kNN-VC and Diff-VC.

Practical Applications:

Low-resource language support: As a training-free method that works well cross-lingually with short references, MKL-VC is highly applicable to languages lacking large speech corpora required for training other VC models.
Dubbing and content localization: Enables efficient voice conversion for dubbing videos or audio content into various languages using limited reference audio from target speakers.
Personalized Text-to-Speech (TTS): Can potentially be used to adapt a standard TTS voice to a target speaker's voice using just a few seconds of their speech.
Accessibility tools: Creating synthetic voices for individuals with speech impairments based on recordings of their voice prior to impairment.

Implementation Considerations & Trade-offs:

Computational Requirements: Encoding and decoding take the most time, but computing statistics and applying the MKL map involve matrix operations (inverses, square roots) on $K \times K$ matrices, which is feasible for small $K$ . Pre-computing sorted dimensions requires a large corpus and offline processing.
Memory Requirements: Storing reference embeddings to compute statistics and potentially the sorted dimension indices.
Parameter Choice ( $K$ ): Requires empirical tuning based on whether content intelligibility or speaker similarity is prioritized for the specific application.
Encoder Dependency: The method's effectiveness relies on the specific structure of WavLM embeddings and the assumption of approximate Gaussianity within partitioned dimensions. Applying this to embeddings from different encoders (HuBERT, etc.) would require validating this assumption.
Latency: As a non-streaming method, it processes the entire source and reference audio segments to compute statistics and apply the map. Real-time or low-latency applications might require buffering.

In summary, Factorized MKL-VC provides a computationally efficient, training-free approach to high-quality voice conversion, particularly robust with short reference audio and effective for cross-lingual scenarios, making it a promising technique for practical applications, especially in low-resource settings.