Cross-domain Neural Pitch and Periodicity Estimation (2301.12258v3)

Published 28 Jan 2023 in eess.AS and cs.SD

Abstract: Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. Our estimator implementations run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU$\unicode{x2014}$approaching the speed of state-of-the-art DSP-based pitch estimators$\unicode{x2014}$or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at https://github.com/interactiveaudiolab/penn.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces refined models (CREPE++, DeepF0++, and FCNF0++) that use finer pitch quantization and training on unvoiced frames to reduce quantization error.
It employs an entropy-based periodicity estimation and layer normalization to outperform traditional methods in accuracy and computational efficiency.
The research achieves robust cross-domain generalization across speech and music, paving the way for advanced real-time audio processing applications.

Cross-Domain Neural Pitch and Periodicity Estimation

This paper presents a comprehensive paper on enhancing neural network-based pitch and periodicity estimators, advancing their performance across the domains of speech and music. The authors propose several methodological improvements, achieving high performance and computational efficiency, which contribute to the field of audio signal processing.

Methodological Innovations

The authors build upon three established neural pitch tracks: CREPE, DeepF0, and FCNF0. They introduce a series of refinements collectively termed as CREPE++, DeepF0++, and FCNF0++. These refinements include finer pitch quantization, training on unvoiced audio frames, elimination of early stopping, and adoption of categorical cross-entropy loss over binary cross-entropy. Additionally, an entropy-based method for periodicity estimation is proposed, offering a more elegant and effective approach compared to conventional methods.

The paper emphasizes the importance of using a high-resolution frequency quantization approach, with a pitch bin width reduced from prior 20-25 cents to just five cents, thereby minimizing quantization error. Furthermore, by including unvoiced audio frames during training and setting their ground truth pitch bin to a random bin, the models encourage high-entropy uniform distribution in unvoiced regions, improving periodicity accuracy.

From an architectural standpoint, adopting layer normalization in favor of batch normalization has shown enhanced performance benefits. Together, these methodological improvements underscore a commitment to optimizing the neural network framework for efficient and accurate pitch and periodicity estimation.

Results and Performance

The proposed models are rigorously evaluated against their predecessors and other existing methods such as PYIN and DIO, showcasing significant improvements in both pitch and periodicity estimation across speech and music domains. Notably, FCNF0++ emerges as a model offering a compelling balance of performance and speed, with CPU inference speeds closely rivaling those of state-of-the-art DSP-based methods.

FCNF0++ outperforms CREPE, DeepF0, and FCNF0 in typical metrics such as average error in cents, raw pitch accuracy, and raw chroma accuracy. Importantly, the entropy-based periodicity decoding introduced by the authors demonstrates superior performance over max-based decoding methods, as reflected in F1 scores for binary voicing classification.

The paper also highlights the challenges posed by dataset-specific characteristics, which affect cross-domain performance. By training models on both music and speech datasets, they achieve notable cross-domain generalization — a vital trait for real-world applicability of such estimators.

Implications and Future Directions

The implications of this research are significant for both practical applications and theoretical advancements in AI, specifically in audio signal processing. The improvements in pitch estimation precision and computational efficiency cater to the growing demands of real-time audio analysis and synthesis applications, from music information retrieval to advanced speech synthesis and transformation tasks.

Looking forward, future work could investigate universal neural pitch estimation, ensure that models generalize across unseen datasets, and extend results to polyphonic pitch estimation. Additionally, examining the relationship between periodicity and signal energy might provide insights for applications where these properties need to be independently controlled.

Overall, the authors highlight a plausible path toward achieving robust, fast, and accurate neural pitch estimation, offering valuable tools and methodologies for both academic research and practical deployment in audio processing technologies.

PDF Markdown

Related Papers

GitHub

GitHub - interactiveaudiolab/penn: Pitch Estimating Neural Networks (PENN) (257 stars)