SpeechSynth: Synthetic Pitch Dataset

Updated 27 August 2025

SpeechSynth is a synthetic speech pitch dataset offering precise, time-aligned annotations, essential for benchmarking and training pitch detection models.
It leverages a phoneme-level TTS model (LightSpeech) on Mandarin corpora, ensuring controlled diversity, infinite scalability, and accurate pitch synthesis.
It integrates a unified pitch evaluation metric, aggregating six complementary measures to balance precision in tonal and synthetic speech analysis.

SpeechSynth refers to a synthetic speech dataset introduced as part of SwiftF0, a pitch detection benchmark and model suite designed to address the long‐standing problem of unreliable or inexact ground-truth pitch data in speech corpora (Nieradzik, 25 Aug 2025). SpeechSynth enables training and evaluation of pitch estimation algorithms under conditions of perfectly known, on-demand pitch, uniquely positioning it as a reference set for evaluating monophonic pitch tracking in speech. This dataset is tightly integrated into the SwiftF0 framework, which also introduces a unified pitch evaluation metric that aggregates multiple complementary performance measures, establishing a well-defined ground for both development and benchmarking of state-of-the-art pitch detection models.

1. Motivation and Rationale

Speech pitch detection research is hindered by the lack of datasets with exact pitch annotations. Real-world speech datasets typically provide pitch estimates sourced from laryngograph signals or extracted via algorithmic estimators, both of which incur accuracy and consistency limitations due to noise, speaker variation, and the inherent ambiguity of pitch in some speech segments. SpeechSynth directly overcomes this limitation by generating synthetic speech where ground-truth pitch contours are explicitly specified by the text-to-speech (TTS) system, ensuring maximal precision. This capability allows for rigorous algorithmic evaluation and controlled experimentation, particularly for models intended for real-time, noisy, or domain-transferrable deployment.

2. Synthetic Dataset Generation

SpeechSynth is constructed by leveraging a state-of-the-art phoneme-level TTS model, LightSpeech, trained exclusively on Mandarin speech datasets. Unlike word-level text-to-speech systems, the use of a phoneme-conditioned TTS enables fine resolution over phonetic, tonal, and segmental properties critical for pitch analysis and synthesis.

Corpora Used: Training data comprises the AISHELL-3 corpus (85.62 hours) and the Biaobei corpus (11.86 hours), providing a total of 97.5 hours of fully transcribed Mandarin speech. The phoneme inventory includes 54 unique phones, covering Mandarin with both standard vowels and tones.
Pitch Generation: During synthesis, pitch contours are assigned and rendered through the LightSpeech model, resulting in speech signals whose pitch targets are fully known across all frames without the uncertainties of physical laryngeal measurements or pitch extractors.
Flexibility: The TTS-driven approach enables scaling the dataset arbitrarily, creating new sentences, phoneme sequences, or tones on demand, and controlling speaker identity.

The resulting SpeechSynth dataset thus consists of speech segments with perfectly specified, time-aligned pitch (F0) values for every frame, eliminating ambiguity in voicing state or instantaneous frequency.

3. Role in Training and Evaluation

SpeechSynth serves a dual function in SwiftF0: as a source of unlimited training data with exact labels, and as a testbed for rigorous model evaluation.

Training Utility: SwiftF0 exploits SpeechSynth during model training to fit to true pitch without corrupted or noisy supervision. As voice and phone content are under programmatic control, training can include rare, difficult, or edge-case phoneme/pitch configurations often underrepresented in natural data.
Evaluation Benchmark: SpeechSynth is designated as a “held-out” evaluation set in the SwiftF0 benchmark, enabling fair, standardized assessment of models, including analysis of accuracy under tonal, non-tonal, or synthetic speech conditions—especially challenging for Mandarin or other tone languages.

Traditional pitch datasets are limited by inexact annotation or small size. The ability to generate arbitrary volumes of data, precisely labeled, from SpeechSynth provides an effective countermeasure to overfitting and annotation bias.

4. Advantages Over Real Speech Corpora

Ground Truth Assurance: Unlike datasets annotated by laryngograph or algorithmic means, SpeechSynth’s pitch annotations are by construction exact.
Infinite Scalability: The dataset can be expanded to any size needed for model pretraining, augmentation, or evaluation.
Controlled Diversity: Phonetic, prosodic, or speaker diversity can be systematically specified, aiding ablation analysis.
Tonality Coverage: The dataset’s Mandarin foundation and phoneme/tonal resolution support benchmarking on tone languages—a critical shortcoming in many Western-centric corpora.

A plausible implication is that, by augmenting limited human-labeled speech corpora with perfect-label synthetic data, pitch models like SwiftF0 can bridge the performance gap in data-scarce or complex settings.

5. Unified Pitch Detection Metric

The SwiftF0 framework, via SpeechSynth, also introduces a unified metric for evaluating pitch detection performance, aggregating six complementary metrics into a harmonic mean. This is specifically designed to prevent over-optimization for a single error type (e.g., only voicing, or only octave errors).

The Six Component Metrics

Metric	Definition	Role
Raw Pitch Accuracy (RPA)	Fraction of voiced frames within 50 cents of ground truth.	Standard accuracy measure, robust voicing.
Cents Accuracy (CA)	Exponentially penalizes average deviation.	Precision on fine pitch errors.
Voicing Precision (P)	Fraction of frames predicted voiced that are truly voiced.	Voicing false-positive control.
Voicing Recall (R)	Fraction of truly voiced frames identified as voiced.	Voicing false-negative control.
Octave Accuracy (OA)	Penalty for octave confusions.	Measures octave jumps/miscategorizations.
Gross Error Accuracy (GEA)	Penalizes gross errors (>200 cents).	Stability under large deviations.

Each metric is defined formally, e.g., pitch error in cents:

$\Delta_t = 1200 \cdot \log_2\left(\frac{f_\text{pred}, t}{f_\text{true}, t}\right)$

The harmonic mean over these six metrics yields an overall score:

$\mathrm{HM} = \frac{6}{\displaystyle\sum_{i=1}^{6} \frac{1}{c_i}}$

where $c_i \in \{\mathrm{RPA}, \mathrm{CA}, \mathrm{P}, \mathrm{R}, \mathrm{OA}, \mathrm{GEA}\}$ .

The strictness of the harmonic mean enforces the need for robust and balanced performance across all error types (Nieradzik, 25 Aug 2025).

6. Impact on Pitch Detection Research

SpeechSynth’s integration into the SwiftF0 benchmark represents a methodological advance: it enables benchmarking and algorithmic development decoupled from label noise, which historically limited both cross-paper comparability and the upper bounds of algorithmic accuracy. As a testbed for pitch detection—including evaluation on tonal, non-tonal, and synthetic speech—SpeechSynth supports both absolute and differential performance analysis. The public release of dataset generation protocols, along with a live demo and open-source benchmark code, is enabling broader adoption and transparent progress measurement.

This approach is particularly valuable for new lightweight neural pitch models designed for resource-constrained or real-time environments, where subtle weaknesses in voicing or octave errors might otherwise be masked in standard test sets.

7. Limitations and Future Directions

While SpeechSynth enables exact pitch evaluation in synthetic speech, it is restricted to domains covered by the phoneme inventory and speaker diversity of the underlying TTS model (Mandarin, LightSpeech, and associated phonemes/tones). Extension to other languages or broader prosodic landscapes may further improve coverage and applicability. Additionally, future work could incorporate controlled style, emotion, or cross-lingual data, further extending the generalizability and diversity of synthetic pitch benchmarks.

Summary Table: SpeechSynth in the SwiftF0 Framework

Attribute	Description
Generation	Phoneme-level TTS (LightSpeech on Mandarin)
Label Accuracy	Perfect, by construction
Phonemes	54, including all Mandarin tones
Training Utility	Unlimited fluent speech, edge-case synthesis
Evaluation Utility	Ground-truth pitch for strict benchmarking
Evaluation Metric	Unified harmonic mean over six dimensions
Limitations	Focused on Mandarin, TTS-domain boundaries

SpeechSynth thus establishes a new reference point for robust, balanced, and fine-grained pitch detection analysis, enabling fair comparison and rapid progress in fundamental pitch estimation research (Nieradzik, 25 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

SwiftF0: Fast and Accurate Monophonic Pitch Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SpeechSynth.