SwiftF0: Efficient Monophonic Pitch Detection
- SwiftF0 is a lightweight neural framework for monophonic pitch detection that integrates a compact CNN and novel synthetic training data for unified pitch evaluation.
- The framework achieves state-of-the-art performance with a harmonic mean accuracy of 91.80% and an impressive 42× runtime speedup over traditional models like CREPE.
- It reduces STFT input complexity by 74% via aggressive frequency band selection and employs efficient real-time processing suitable for resource-constrained systems.
SwiftF0 is a lightweight neural framework for monophonic pitch detection that achieves state-of-the-art accuracy and computational efficiency, with explicit emphasis on real-time processing in noisy environments and resource-constrained systems. The framework integrates a compact convolutional network design, a unified pitch evaluation metric, and novel synthetic training data. Notable features include a parameter count of 95,842, a runtime approximately 42× faster than leading baselines such as CREPE, and robust generalization enabled by diverse and precisely labeled datasets. Benchmarking at 10 dB SNR demonstrates a harmonic mean (HM) accuracy of 91.80%, outperforming other neural and classical signal-based approaches. SwiftF0 is released open source, with code, demo, and evaluation suite available online.
1. Model Architecture and Signal Processing Pipeline
SwiftF0 employs a compact convolutional neural network that processes STFT-based magnitude spectrograms of audio input. The preprocessing stage utilizes a Hann window (length 1024, hop size 256) and aggressively reduces the frequency range to 46.875–2093.75 Hz by discarding all but bins 3 to 134 (74% reduction in STFT dimension). The remaining time–frequency matrix undergoes five 2D convolutional layers with channel progression (8, 16, 32, 64, 1), 5×5 kernels, stride 1, and same padding, resulting in a receptive field of nearly 21×21 bins. Feature normalization is enforced via batch normalization and nonlinearity is introduced with ReLU activations.
A final 1D convolutional projection compresses the frequency dimension to 200 logarithmically spaced pitch bins. The bin centers are defined as
where is the total number of bins, and correspond to 46.875 Hz and 2093.75 Hz respectively, and bin spacing matches constant pitch resolution in cents (33.1 cents per bin). The architecture supports both discrete pitch estimation (via softmax over bins) and continuous pitch regression (via expected value).
2. Training Methodology and Data Augmentation
The training regime targets robust pitch detection under noisy conditions and cross-domain variability. SwiftF0 is trained on a diverse collection of sources:
- NSynth (musical instrument notes)
- PTDB-TUG (speech)
- MIR-1k (singing)
- MDB-STEM-Synth (synthetic music)
- SpeechSynth (synthetic, phoneme-level speech samples with exact pitch curves)
The objective comprises two distinct terms:
- Categorical cross-entropy for pitch classification:
where is the one-hot bin label, the softmax prediction.
- L1 loss in log-frequency space (regression):
The combined objective is , with set to 1.
Data augmentation introduces gain variation (–6 to +6 dB), and mixes random proportions of environmental (CHiME-Home) and synthetic Gaussian noise. Target SNR is uniformly sampled between 10 and 30 dB for each training audio segment. The signal is clamped to [–1, +1] post-augmentation. Crucially, augmentation is applied only to voiced segments, protecting pitch information and improving robustness.
3. Performance Evaluation and Unified Metric
SwiftF0 introduces a unified evaluation metric—a harmonic mean (HM) of six measures—to comprehensively assess pitch detection:
- Raw Pitch Accuracy (RPA): fraction of voiced frames with error < 50 cents.
- Cents Accuracy (CA):
- Voicing Precision (P): correct voiced predictions over all voiced predictions.
- Voicing Recall (R): detected voiced frames over all true voiced frames.
- Octave Accuracy (OA):
- Gross Error Accuracy (GEA):
The HM is computed by
requiring balanced performance across all aspects. At 10 dB SNR, SwiftF0 achieves HM = 91.80%, outperforming CREPE by more than 12 percentage points. The degradation from clean conditions is only 2.3 points, demonstrating noise resilience.
4. Computational Efficiency and Real-Time Deployment
Engineered for deployment on resource-limited hardware, SwiftF0 comprises only 95,842 parameters—vastly fewer than CREPE (≈ 22M parameters). The aggressive frequency band selection and input reduction decrease the STFT size by 74%, minimizing both memory footprint and compute time. Benchmarking shows that SwiftF0 processes 5 seconds of audio in approximately 132.6 ms on a CPU, corresponding to a 42× speedup over CREPE. This suggests suitability for live applications such as mobile audio processing and embedded systems.
5. SpeechSynth Synthetic Dataset and Ground Truth
SpeechSynth provides perfectly accurate ground-truth pitch, resolving a key limitation in speech corpora where available ground truth is typically produced by indirect algorithmic estimators or sparse laryngograph measurements. It is built using LightSpeech, a phoneme-level TTS system trained on the AISHELL-3 and Biaobei datasets (97.5 h Mandarin data), yielding fine-grained control over pitch contours at the phoneme level (54 phones, multiple tones).
This synthetic generation capability ensures consistent and reliable pitch annotation for both model training and evaluation—particularly important in benchmarking and quantifying pitch estimation performance.
6. Open Source Suite and Benchmarking Infrastructure
All principal elements of the SwiftF0 framework are released as open source. Resources include:
- The SwiftF0 codebase: https://github.com/lars76/swift-f0
- The live demo: https://swift-f0.github.io/
- Benchmarking suite: https://github.com/lars76/pitch-benchmark
These enable reproducibility, comparative evaluation, and extension of methods for academic and professional research in pitch detection.
7. Methodological Innovations and Practical Significance
The combination of spectral input reduction, joint classification/regression objective, and a harmonically integrated metric yields a framework that sets new practical and computational baselines for monophonic pitch detection. Empirical results demonstrate both superior accuracy and run-time efficiency in real-world acoustic scenarios, notably at low SNR and across broad acoustic genres.
A plausible implication is encouragement for future research to adopt synthetic data approaches for exact ground truth labeling, and to evaluate pitch estimation methods under the unified metric paradigm, avoiding “metric cherry-picking.” This architecture, dataset, and benchmarking methodology collectively advance the state of monophonic pitch estimation.