Papers
Topics
Authors
Recent
2000 character limit reached

Synthetic Parallel Persian-English Speech Generation

Updated 23 November 2025
  • The paper introduces a neural pipeline that automatically translates Persian text to conversational English and synthesizes it into natural speech using zero-shot TTS.
  • It applies discrete unit representations via HuBERT quantization to convert continuous audio into scalable, symbolic features, improving model alignment.
  • The synthetic approach boosts performance by +4.6 BLEU points, effectively addressing data scarcity challenges in low-resource Persian-English speech applications.

Synthetic parallel Persian-English speech generation refers to automatic construction of paired spoken utterances in Persian and English that align at the utterance level, supporting data-scarce applications such as direct speech-to-speech translation (S2ST), multilingual TTS, and voice conversion. Recent advances exploit LLMs for translation, zero-shot neural text-to-speech (TTS) for audio synthesis, and discrete unit-based representations to create extensible and scalable pipelines, mitigating the paucity of human-recorded parallel resources for low-resource pairs such as Persian–English. These synthetic corpora enable effective supervised or end-to-end neural training for S2ST, evaluation, and benchmarking under low-resource constraints (Rashidi et al., 16 Nov 2025).

1. Motivation and Challenges in Parallel Speech Generation

Direct speech-to-speech translation (S2ST) models benefit from parallel speech corpora in both source and target languages; however, such data is rarely available for Persian–English. Synthetic generation addresses:

  • Data scarcity: Parallel Persian–English speech corpora are limited (CVSS Fa–En ≈ 20 hours, ≈11k utterances), restricting supervised S2ST model capacity (Rashidi et al., 16 Nov 2025).
  • Utterance-aligned requirements: Direct S2ST bypasses intermediate text, but models require large-scale utterance-level alignment of audio pairs—difficult to obtain via manual collection for low-resource languages.
  • Speaker and domain diversity: Human-annotated data often lacks broad coverage; synthetic pipelines can scale to more speakers and content domains.

Synthetic parallel generation thus targets three core outcomes: increased data scale, accurate utterance-level alignment, and naturalness sufficient for downstream model training.

2. Automated Parallel Speech Generation Pipeline

(Rashidi et al., 16 Nov 2025) details a data pipeline that converts monolingual Persian speech corpora into parallel Persian–English speech pairs through sequential neural translation and synthesis:

  1. Collection and Preprocessing: Starts from cleaned Persian speech utterances and transcriptions (e.g., ∼20k–30k utterances filtered from Common Voice Fa).
  2. Machine Translation Step: Each Persian transcription is translated into spoken-style English using GPT-4o in a zero-shot or few-shot setup. The system prompt enforces fluent, semantically faithful conversational English. No explicit prompt tuning (e.g., temperature, exemplars) is specified, likely relying on default GPT-4o settings.
  3. Text-to-Speech Synthesis: The translated English sentence is synthesized into natural-sounding speech using VoiceCraft, a state-of-the-art zero-shot TTS with style-conditioning and zero-shot speaker adaptation via reference-encoder modules. Quality is described as natural-sounding, though no formal metrics (e.g., MOS, MCD) are reported.
  4. Corpus Assembly: Resulting pairs are aligned at the utterance level, producing synthetic English waveforms paired with the original Persian speech.

This workflow scales parallel data availability by approximately sixfold (to ≈120 hours), without manual annotation or human English speech recording. No explicit filtering or evaluation of GPT or TTS outputs is applied beyond initial text quality control.

3. Discrete Speech Units and Representation

Downstream S2ST models benefit from representing target English speech as sequences of discrete units rather than waveforms or continuous features. (Rashidi et al., 16 Nov 2025) adopts the following encoding:

  • Feature extraction: Frame-level features are computed from target English audio using pretrained HuBERT (hidden-unit BERT) models.
  • Quantization: k-means clustering is applied to HuBERT embeddings to create a codebook of size KK (prior work uses K{100,200,500}K\in\{100,200,500\}), mapping each frame to a unit index.
  • Unit sequence: Each English utterance is then encoded as u=(z1,z2,,zT)u=(z_1,z_2,\dots,z_T), where ztz_t is the discrete index at frame tt.

The clustering objective is:

minCRK×d,  zi{1,,K}i=1NhiCzi22\min_{C\in\mathbb{R}^{K\times d},\;z_i\in\{1,\dots,K\}} \sum_{i=1}^N \lVert h_i - C_{z_i} \rVert^2_2

where hih_i is the HuBERT embedding, CjC_j is the jjth centroid, and ziz_i is the index assignment. This quantization enables S2ST models to learn a mapping between source features and symbolic target units, facilitating modular waveform decoding.

4. Neural Vocoding from Discrete Units

A neural vocoder reconstructs waveforms from discrete target unit sequences. (Rashidi et al., 16 Nov 2025) employs HiFi-GAN, a generative adversarial network trained on LJSpeech, to synthesize audio from unit-derived embeddings.

  • Network design: HiFi-GAN uses a generator GG (upsampling unit sequences to audio) and multi-period, multi-scale discriminators DD.
  • Losses: The training objective combines adversarial (LSGAN-style), feature-matching, and mel-spectrogram (L1L_1) reconstruction losses:

Lvocoder=Ladv+λFMLFM+λmelLmel\mathcal{L}_{\text{vocoder}} = \mathcal{L}_{\text{adv}} + \lambda_{\text{FM}}\,\mathcal{L}_{\text{FM}} + \lambda_{\text{mel}}\,\mathcal{L}_{\text{mel}}

with typical λFM=10\lambda_{\text{FM}}=10, λmel=45\lambda_{\text{mel}}=45. The full waveform is synthesized as x^=G(u)\hat{x} = G(u). Default HiFi-GAN “V1” hyperparameters are used.

This decoupling enables high-quality, speaker-independent waveform synthesis from the unit representation, supporting rapid scalable corpus generation and downstream model training.

5. Model Training, Objectives, and Evaluation

The synthetic corpus supports direct S2ST training using a conformer-based encoder (initialized from self-supervised pre-training) to map Persian speech to high-level representations, and a causal transformer decoder to predict English discrete units.

  • Training loss: Model training minimizes the negative log-likelihood of the target unit sequence, with optional label smoothing:

LS2ST=t=1Tlogpθ(ztz<t,Enc(x))\mathcal{L}_{\text{S2ST}} = -\sum_{t=1}^T \log p_\theta(z_t \mid z_{<t},\,\mathrm{Enc}(x))

LCE=(1ε)logpθ(zt)εK1kztlogpθ(k)\mathcal{L}_{\text{CE}} = -(1-\varepsilon)\log p_\theta(z_t) - \frac{\varepsilon}{K-1}\sum_{k\neq z_t}\log p_\theta(k)

  • Evaluation metric: ASR-BLEU score is used; the BLEU score is calculated as:

BLEU=BPexp(n=14wnlogpn)\mathrm{BLEU} = \mathrm{BP} \exp\left(\sum_{n=1}^{4} w_n \log p_n\right)

where pnp_n is the n-gram precision, wn=1/4w_n=1/4.

Empirically, the paper reports a +4.6 BLEU point improvement with synthetic data (4.1→17.8 on CVSS Fa–En), demonstrating that synthesizing sixfold more parallel data yields substantial accuracy gains and outperforms previous direct baselines (Translatotron 2) (Rashidi et al., 16 Nov 2025). No auxiliary CTC or alignment loss is used.

“Unsupervised Polyglot Text-to-Speech” (Nachmani et al., 2019) demonstrates a complementary paradigm for synthetic parallel speech: a single network is trained to transfer speaker identity across multiple languages without parallel speech. Its architecture merges per-language text encoders and speaker encoders into a shared decoder, trained with reconstruction, speaker preservation, and “polyglot” transfer losses.

  • Loss functions: Include LreconL_\text{recon}, a contrastive margin loss, speaker cycle-consistency (LcycleL_\text{cycle}), and a cross-lingual voice embedding transfer loss (LpolyL_\text{poly}):

L=Lrecon+λspkLspk+λpolyLpolyL = L_{\mathrm{recon}} + \lambda_{\mathrm{spk}} L_{\mathrm{spk}} + \lambda_{\mathrm{poly}} L_{\mathrm{poly}}

  • Zero-parallel data regime: Training requires only unpaired English and Persian corpora; speaker identity is transferred through embedding alignment losses.
  • Evaluation: Naturalness, speaker similarity (MOS, EER), parallelism, and voice consistency are measured objectively and subjectively.

This demonstrates a route for generating speaker-consistent Persian–English parallel speech, without requiring any initial utterance pairs, expanding the landscape of synthetic cross-lingual TTS (Nachmani et al., 2019).

7. Impact and Considerations

Synthetic Persian–English parallel speech generation, using neural translation and TTS, significantly amplifies data scale for low-resource S2ST, enabling rigorous model evaluation and improvement. Key empirical findings:

Model CVSS-Only CVSS+Synthetic
Translatotron 1.4 6.9
Speech-to-unit (+ pretr.) 2.8 13.2
Proposed model 4.1 17.8

Synthetic data provides a +4.6 BLEU point gain for the proposed model over direct baselines (Rashidi et al., 16 Nov 2025). This supports the efficacy of large-scale neural synthetic pipelines in augmenting direct S2ST model training for Persian–English and potentially other low-resource language pairs.

Application domains span direct S2ST, cross-lingual TTS, and multilingual voice conversion. However, there remain open questions regarding long-term quality, diversity, and speaker authenticity in extremely large-scale synthetic corpora. The absence of formal MOS or speaker similarity measurements in the cited work suggests areas for further research, as does the optimization of text-to-speech prompt engineering and speaker reference mechanisms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Synthetic Parallel Persian-English Speech Generation.