SwiftF0: Neural Pitch Detection Model
- SwiftF0 is a neural pitch detection model that uses a compact convolutional architecture to achieve rapid and accurate monophonic F0 estimation.
- It leverages extensive synthetic training and robust data augmentation to maintain high accuracy, with a unified metric achieving 91.80% HM at 10 dB SNR.
- Its efficient design, with only 95,842 parameters and 42× faster CPU inference than CREPE, enables real-time deployment on resource-constrained devices.
SwiftF0 is a neural pitch detection model designed for fast, accurate monophonic fundamental frequency (F0) estimation in audio streams, particularly under noisy conditions and on resource-constrained devices. It combines a compact convolutional architecture with robust data augmentation and novel synthetic training resources to achieve state-of-the-art performance and efficiency. SwiftF0 defines a unified metric for pitch detection accuracy and is accompanied by an open-source benchmarking suite, facilitating reproducible and comprehensive evaluation.
1. Architecture and Signal Processing
SwiftF0 operates on short segments of audio by computing the short-time Fourier transform (STFT) with a Hann window of size and hop size (16 ms time resolution, 15.625 Hz frequency resolution). The resulting spectrogram is compressed logarithmically. Only 132 STFT frequency bins (covering [46.875 Hz, 2093.75 Hz]) are retained, effectively discarding 74% of bins to focus computation on plausible monophonic pitch regions and improve noise robustness.
The network consists of five consecutive 2D convolutional layers, each with a kernel, stride 1, and “same” padding, resulting in feature maps with channel dimensions . This stack confers a receptive field of bins (c. 328 Hz in frequency), capturing local harmonics and fundamental structure necessary for robust F0 inference.
The network’s final output is mapped through a 1D convolutional projection from 132 linearly spaced frequency bins to 200 logarithmically spaced pitch bins. Bin centers are computed as , where and ; this yields a bin resolution of approximately 33.1 cents.
The total parameter count is 95,842—a reduction by two orders of magnitude versus previous neural pitch estimators such as CREPE (∼22M parameters).
2. Training Procedure and Data Augmentation
SwiftF0 is trained on a diversity of speech, music, and synthetic datasets. Training leverages a composite objective that combines categorical cross-entropy over pitch bins with an L1 regression loss on log-frequency in cents:
- (with )
gives the expected pitch (log-frequency) for each frame .
Robustness is further enhanced through extensive on-the-fly augmentation:
- Random amplitude scaling (gain in dB):
- Background noise injection from CHiME-Home real recordings and synthetic Gaussian noise, mixed according to a sampled coefficient
- Random SNR selection in dB, with noise scaling to achieve target SNR
- Frame selection centered on voiced segments for training stability
This protocol yields high generalization to unseen acoustic domains and noise conditions.
3. Evaluation Metrics and Unified Harmonic Mean
SwiftF0 introduces a unified evaluation framework based on the harmonic mean (HM) of six complementary pitch detection metrics:
- Raw Pitch Accuracy (RPA): fraction of voiced frames with estimates within 50 cents of ground-truth, ,
- Cents Accuracy (CA): , with mean absolute error in cents,
- Voicing Precision (P),
- Voicing Recall (R),
- Octave Accuracy (OA),
- Gross Error Accuracy (GEA).
The HM aggregates these as:
where is the score for each metric. This unified measure precludes artificial inflation by excelling in one aspect while failing in others.
In benchmark evaluations at 10 dB SNR, SwiftF0 reaches 91.80% HM, exceeding CREPE by over 12 percentage points. The performance decline from clean audio is only 2.3 points, suggesting strong resilience to noise.
4. Computational Efficiency and Deployment
The minimal architecture, together with efficient preprocessing, results in significant computational speedup and reduced resource usage:
- On CPU, SwiftF0 is approximately 42× faster than CREPE (e.g., 5 s clip: 132.6 ms vs. 5508.3 ms).
- The small parameter footprint (95,842) enables real-time inference on embedded devices and mobile platforms.
A plausible implication is that SwiftF0’s design enables deployment in latency-sensitive or battery-constrained domains (e.g., real-time speech analysis, mobile music education, live synthesis).
5. Ground Truth Generation via SpeechSynth
SwiftF0 addresses the lack of exact ground-truth pitch in speech corpora by introducing SpeechSynth, a synthetic speech dataset with perfect annotation:
- Synthesized via a LightSpeech TTS model (trained on AISHELL-3 and Biaobei Mandarin datasets)
- Phoneme-level control (54 phone classes) yields precise, on-demand pitch trajectories
- Enables robust model training and unambiguous evaluation scenarios, contrasting with typical algorithmic or laryngograph-based annotations
This methodology provides new ground in neural F0 research, supporting fairer benchmarking and improved training data quality.
6. Benchmarking Suite and Reproducibility
To facilitate rigorous evaluation, SwiftF0 is deployed alongside an open-source pitch benchmark suite:
- Includes standardized datasets (Vocadito, Bach10-mf0-synth, SpeechSynth test set)
- Provides scripts for unified metric evaluation and routine assessment across methods and tasks
- Source code for SwiftF0 and the benchmark suite is publicly available (https://github.com/lars76/swift-f0 and https://github.com/lars76/pitch-benchmark)
This infrastructure supports reproducible research and comparative analysis in the pitch estimation domain.
7. Practical Significance and Applications
SwiftF0’s operational characteristics align with requirements in speech analysis, music information retrieval, audio synthesis, and similar fields. The fast inference, reliability under acoustic degradation, and unified metric design facilitate real-time deployment and fair cross-task comparison. A live interactive demo is available (https://swift-f0.github.io/), enabling practitioners to inspect behavior under real-world conditions.
Its introduction marks a substantive advance in neural monophonic pitch detection, demonstrating that compact architectures, advanced synthetic training data, and unified evaluation strategies can jointly yield efficiency and robustness previously reserved for large-scale models, as observed in comparative results against contemporaries such as CREPE.
A plausible implication is that future research may further exploit synthetic datasets and lightweight architectural strategies to displace conventional large neural models in signal processing tasks requiring both real-time and precision constraints.