Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 131 tok/s Pro
Kimi K2 168 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SpeechSynth: Accurate Synthetic Speech Pitch

Updated 2 October 2025
  • SpeechSynth is a synthetic speech dataset that provides perfectly accurate, on-demand pitch curves by eliminating the uncertainties of algorithmic estimation.
  • It employs a LightSpeech neural TTS model with controlled phonetic and tonal synthesis, enabling deterministic, phoneme-level prosodic annotation.
  • The dataset facilitates robust training and benchmarking, supporting high-accuracy models like SwiftF0 through precise, framewise pitch evaluation metrics.

SpeechSynth is a synthetic speech dataset engineered to address critical shortcomings in pitch estimation research by providing perfectly accurate, on-demand ground-truth pitch curves for model development and evaluation (Nieradzik, 25 Aug 2025). The dataset is generated using a phoneme-level text-to-speech (TTS) system and leverages controlled phonetic and tonal specification during synthesis, which enables deterministic and error-free pitch annotation. This capability allows SpeechSynth to bypass the inherent uncertainties and noise in pitch labels typically present in natural speech corpora, which are often derived through algorithmic estimators or noisy laryngograph signals.

1. Dataset Construction and Principles

SpeechSynth employs a LightSpeech neural TTS model, trained on a hybrid corpus comprising 97.48 hours of Mandarin speech from AISHELL-3 and Biaobei. The model is conditioned on a phone inventory of 54 units inclusive of tone markers, supporting fine-grained control over both segmental and suprasegmental properties. Each speech sample in SpeechSynth is synthesized at the phoneme level, and the corresponding pitch contour is exactly specified by the synthesis parameters—this enables the generation of arbitrary phonetic and prosodic scenarios under full supervision.

The construction protocol ensures that each (audio, phonetic sequence) pair is accompanied by its corresponding ground-truth pitch curve by direct synthesis, not estimation or manual annotation. This design eliminates the need for conventional pitch extraction algorithms (such as RAPT or pYIN), whose outputs are prone to voicing errors, octave misclassifications, and tracking artifacts, especially in noisy or expressive speech.

2. Ground-Truth Pitch Curve Provision

Unlike natural speech datasets, in which pitch curves must be estimated post hoc, SpeechSynth provides the exact pitch trajectory embedded during synthesis. The LightSpeech engine tracks and outputs the intended f0f_0 curve at every frame, ensuring that each sample’s annotation matches the intended phonetic and tonal specification without error. Therefore, the dataset provides continuous pitch contours ftrue,tf_{\text{true},t} for every time frame tt, enabling framewise evaluation at sample-level precision.

No smoothing or post-processing is required; the pitch annotation represents the true metric used during waveform generation. This property is essential for robust training and benchmarking of monophonic pitch estimation models under both clean and adversarial (noisy) conditions, as any divergence between ground-truth and prediction can be attributed directly to model error.

3. Training and Cross-Domain Evaluation Role

SpeechSynth is integrated into cross-dataset training regimes for models such as SwiftF0, alongside legacy and synthetic corpora including NSynth, PTDB-TUG, MIR-1k, and MDB-STEM-Synth. These combined setups employ 5-fold group cross-validation strategies, where the folds are partitioned according to speaker, instrument, or musical piece identifiers, ensuring robust out-of-group generalization.

For training, SpeechSynth’s perfectly labeled pitch curves augment the pool of noisy and semi-synthetic labels from other datasets, enabling the pitch model to learn both generalizable and precise pitch estimation across a spectrum of real speech, music, and synthetic speech scenarios. For evaluation, a held-out subset of SpeechSynth provides a gold-standard benchmark—differentiating true model performance from the uncertainty introduced by imperfect ground-truth annotation in alternative datasets.

4. Unified Performance Metric Definition

The SwiftF0 evaluation protocol, built on SpeechSynth, reports a unified metric—harmonic mean (HM) of six complementary pitch evaluation measures:

  • Raw Pitch Accuracy (RPA): fraction of correctly predicted voiced frames, Δt<50|Δ_t| < 50 cents.
  • Cents Accuracy (CA): fine-grained accuracy within specified cent bins.
  • Voicing Precision (P): precision for voiced/unvoiced frame classification.
  • Voicing Recall (R): recall rate for voiced frame detection.
  • Octave Accuracy (OA): correct octave classification.
  • Gross Error Accuracy (GEA): robustness to large pitch estimation errors.

The unified metric is given by:

HM=6i=16(1/ci)\text{HM} = \frac{6}{\sum_{i=1}^{6} (1/c_i)}

where cic_i denotes the individual component scores. For instance, RPA is defined for each frame as:

RPA=Number of frames with Δt<50N\text{RPA} = \frac{\text{Number of frames with } |Δ_t| < 50}{N}

with Δt=1200log2(fpred,t/ftrue,t)Δ_t = 1200 \cdot \log_2(f_{\text{pred},t} / f_{\text{true},t}) and NN the total number of frames.

This multidimensional evaluation ensures that pitch estimators must perform consistently well across spectral, temporal, voicing, and error dimensions, preventing misleading results due to singular metric optimization.

5. Benchmarking and Comparative Performance

Experimental results show SwiftF0, trained in part on SpeechSynth, achieves substantially higher accuracy and efficiency compared to prior neural pitch estimators. Specifically:

  • At 10 dB SNR, SwiftF0 attains HM = 91.80%, exceeding CREPE by more than 12 points.
  • SwiftF0 model size: 95,842 parameters vs. CREPE’s 22 million.
  • CPU inference speed: 132.6 ms per 5 s audio versus 5,508.3 ms (SwiftF0 is ~42× faster).
  • Under noise, degradation is only 2.3 points from clean conditions.

These results demonstrate that SpeechSynth’s perfect pitch annotation enables both robust training and highly reliable evaluation for pitch models targeting real-world, low-resource, and noisy deployment scenarios.

6. Applications and Significance

The fundamental property of SpeechSynth—on-demand generation of speech with deterministic pitch annotation—enables several advanced research and engineering tasks:

  • Real-time pitch tracking in embedded and mobile systems, leveraging SwiftF0’s speed and compactness.
  • Accurate prosody modeling, voice conversion, and expressive TTS development, facilitated by error-free pitch ground-truth.
  • Rigorously controlled benchmarking for new pitch estimation architectures across both synthetic and natural domains.
  • Transfer learning and adaptation where synthetic data addressing specific phonotactic or prosodic regimes is required.

A plausible implication is that future benchmarking for both classical and neural pitch extraction methods will increasingly rely on synthetic datasets analogous to SpeechSynth, due to the elimination of annotation uncertainty and the ability to systematically explore prosodic variation.

7. Open Resources and Technical Artifacts

Associated resources include a live demo (https://swift-f0.github.io/), source code (https://github.com/lars76/swift-f0), and a pitch benchmark suite (https://github.com/lars76/pitch-benchmark), enabling public reproducibility and extension of both dataset and model. Technical artifacts include the STFT preprocessing routine summarized as:

STFT(x)[m,k]=n=0N1x[n+mH]w[n]ej2πkn/N\text{STFT}(x)[m,k] = \sum_{n=0}^{N-1} x[n + mH] \cdot w[n] \cdot e^{-j2\pi kn/N}

for N=1024N=1024, H=256H=256, with Hann window w[n]w[n]. This underpins the framewise pitch annotation and model input processing.

In conclusion, SpeechSynth is a targeted synthetic speech and pitch dataset that addresses longstanding bottlenecks in pitch model development. It eliminates annotation noise, enables deterministic benchmarking, and supports scalable, reproducible progress in both academic and practical speech processing applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SpeechSynth Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube