SpeechSynth: Accurate Synthetic Speech Pitch
- SpeechSynth is a synthetic speech dataset that provides perfectly accurate, on-demand pitch curves by eliminating the uncertainties of algorithmic estimation.
- It employs a LightSpeech neural TTS model with controlled phonetic and tonal synthesis, enabling deterministic, phoneme-level prosodic annotation.
- The dataset facilitates robust training and benchmarking, supporting high-accuracy models like SwiftF0 through precise, framewise pitch evaluation metrics.
SpeechSynth is a synthetic speech dataset engineered to address critical shortcomings in pitch estimation research by providing perfectly accurate, on-demand ground-truth pitch curves for model development and evaluation (Nieradzik, 25 Aug 2025). The dataset is generated using a phoneme-level text-to-speech (TTS) system and leverages controlled phonetic and tonal specification during synthesis, which enables deterministic and error-free pitch annotation. This capability allows SpeechSynth to bypass the inherent uncertainties and noise in pitch labels typically present in natural speech corpora, which are often derived through algorithmic estimators or noisy laryngograph signals.
1. Dataset Construction and Principles
SpeechSynth employs a LightSpeech neural TTS model, trained on a hybrid corpus comprising 97.48 hours of Mandarin speech from AISHELL-3 and Biaobei. The model is conditioned on a phone inventory of 54 units inclusive of tone markers, supporting fine-grained control over both segmental and suprasegmental properties. Each speech sample in SpeechSynth is synthesized at the phoneme level, and the corresponding pitch contour is exactly specified by the synthesis parameters—this enables the generation of arbitrary phonetic and prosodic scenarios under full supervision.
The construction protocol ensures that each (audio, phonetic sequence) pair is accompanied by its corresponding ground-truth pitch curve by direct synthesis, not estimation or manual annotation. This design eliminates the need for conventional pitch extraction algorithms (such as RAPT or pYIN), whose outputs are prone to voicing errors, octave misclassifications, and tracking artifacts, especially in noisy or expressive speech.
2. Ground-Truth Pitch Curve Provision
Unlike natural speech datasets, in which pitch curves must be estimated post hoc, SpeechSynth provides the exact pitch trajectory embedded during synthesis. The LightSpeech engine tracks and outputs the intended curve at every frame, ensuring that each sample’s annotation matches the intended phonetic and tonal specification without error. Therefore, the dataset provides continuous pitch contours for every time frame , enabling framewise evaluation at sample-level precision.
No smoothing or post-processing is required; the pitch annotation represents the true metric used during waveform generation. This property is essential for robust training and benchmarking of monophonic pitch estimation models under both clean and adversarial (noisy) conditions, as any divergence between ground-truth and prediction can be attributed directly to model error.
3. Training and Cross-Domain Evaluation Role
SpeechSynth is integrated into cross-dataset training regimes for models such as SwiftF0, alongside legacy and synthetic corpora including NSynth, PTDB-TUG, MIR-1k, and MDB-STEM-Synth. These combined setups employ 5-fold group cross-validation strategies, where the folds are partitioned according to speaker, instrument, or musical piece identifiers, ensuring robust out-of-group generalization.
For training, SpeechSynth’s perfectly labeled pitch curves augment the pool of noisy and semi-synthetic labels from other datasets, enabling the pitch model to learn both generalizable and precise pitch estimation across a spectrum of real speech, music, and synthetic speech scenarios. For evaluation, a held-out subset of SpeechSynth provides a gold-standard benchmark—differentiating true model performance from the uncertainty introduced by imperfect ground-truth annotation in alternative datasets.
4. Unified Performance Metric Definition
The SwiftF0 evaluation protocol, built on SpeechSynth, reports a unified metric—harmonic mean (HM) of six complementary pitch evaluation measures:
- Raw Pitch Accuracy (RPA): fraction of correctly predicted voiced frames, cents.
- Cents Accuracy (CA): fine-grained accuracy within specified cent bins.
- Voicing Precision (P): precision for voiced/unvoiced frame classification.
- Voicing Recall (R): recall rate for voiced frame detection.
- Octave Accuracy (OA): correct octave classification.
- Gross Error Accuracy (GEA): robustness to large pitch estimation errors.
The unified metric is given by:
where denotes the individual component scores. For instance, RPA is defined for each frame as:
with and the total number of frames.
This multidimensional evaluation ensures that pitch estimators must perform consistently well across spectral, temporal, voicing, and error dimensions, preventing misleading results due to singular metric optimization.
5. Benchmarking and Comparative Performance
Experimental results show SwiftF0, trained in part on SpeechSynth, achieves substantially higher accuracy and efficiency compared to prior neural pitch estimators. Specifically:
- At 10 dB SNR, SwiftF0 attains HM = 91.80%, exceeding CREPE by more than 12 points.
- SwiftF0 model size: 95,842 parameters vs. CREPE’s 22 million.
- CPU inference speed: 132.6 ms per 5 s audio versus 5,508.3 ms (SwiftF0 is ~42× faster).
- Under noise, degradation is only 2.3 points from clean conditions.
These results demonstrate that SpeechSynth’s perfect pitch annotation enables both robust training and highly reliable evaluation for pitch models targeting real-world, low-resource, and noisy deployment scenarios.
6. Applications and Significance
The fundamental property of SpeechSynth—on-demand generation of speech with deterministic pitch annotation—enables several advanced research and engineering tasks:
- Real-time pitch tracking in embedded and mobile systems, leveraging SwiftF0’s speed and compactness.
- Accurate prosody modeling, voice conversion, and expressive TTS development, facilitated by error-free pitch ground-truth.
- Rigorously controlled benchmarking for new pitch estimation architectures across both synthetic and natural domains.
- Transfer learning and adaptation where synthetic data addressing specific phonotactic or prosodic regimes is required.
A plausible implication is that future benchmarking for both classical and neural pitch extraction methods will increasingly rely on synthetic datasets analogous to SpeechSynth, due to the elimination of annotation uncertainty and the ability to systematically explore prosodic variation.
7. Open Resources and Technical Artifacts
Associated resources include a live demo (https://swift-f0.github.io/), source code (https://github.com/lars76/swift-f0), and a pitch benchmark suite (https://github.com/lars76/pitch-benchmark), enabling public reproducibility and extension of both dataset and model. Technical artifacts include the STFT preprocessing routine summarized as:
for , , with Hann window . This underpins the framewise pitch annotation and model input processing.
In conclusion, SpeechSynth is a targeted synthetic speech and pitch dataset that addresses longstanding bottlenecks in pitch model development. It eliminates annotation noise, enables deterministic benchmarking, and supports scalable, reproducible progress in both academic and practical speech processing applications.