CREPE: A Convolutional Representation for Pitch Estimation (1802.06182v1)

Published 17 Feb 2018 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on average, there remain many cases in which they fail to correctly estimate the pitch. In this paper, we propose a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform. We show that the proposed model produces state-of-the-art results, performing equally or better than pYIN. Furthermore, we evaluate the model's generalizability in terms of noise robustness. A pre-trained version of CREPE is made freely available as an open-source Python module for easy application.

Citations (343)

View on Semantic Scholar

Summary

The paper presents a novel deep convolutional neural network that estimates monophonic pitch directly from raw audio waveforms.
It achieves over 90% raw pitch accuracy at a 10-cent threshold, outperforming traditional DSP-based pitch trackers in noisy conditions.
The method utilizes a six-layer CNN with sigmoid-activated outputs over 360 pitch bins to model a Gaussian pitch distribution robustly.

Overview of Crepe: A Convolutional Representation for Pitch Estimation

The paper "Crepe: A Convolutional Representation for Pitch Estimation" by Kim et al. introduces CREPE, a novel pitch tracking algorithm that leverages deep convolutional neural networks (CNNs) for the estimation of fundamental frequencies (f0) in monophonic audio signals. This approach diverges from traditional methods, like pYIN, which rely heavily on digital signal processing (DSP) techniques and heuristics. CREPE is distinguished by its ability to operate directly on the time-domain audio waveform, providing state-of-the-art performance in various benchmarks.

Methodology and Architecture

CREPE's core architecture consists of a deep CNN applied directly to audio waveforms sampled at 16 kHz. The network architecture features six convolutional layers that culminate in a 2048-dimensional latent representation. This is followed by a dense output layer with sigmoid activations across 360 pitch bins, each representing a 20-cent interval over six octaves from C1 to B7. The model learns to approximate a Gaussian distribution centered around the true pitch, allowing for robust pitch estimation even in the presence of noise.

The training process involves minimizing the binary cross-entropy between predicted pitch distributions and Gaussian-blurred ground-truth annotations. The authors utilized the ADAM optimizer for this purpose. The significant advantage of this data-driven approach is highlighted by its superior generalizability and robustness compared to traditional heuristic-based methods, particularly in noisy environments.

Experimental Validation

The performance of CREPE was rigorously evaluated using synthesized and re-synthesized datasets to ensure precise ground-truth labeling. It exhibited near-perfect pitch accuracy on the RWC-synth dataset and maintained superior performance against pYIN and SWIPE on the more diverse MDB-stem-synth dataset. Remarkably, CREPE showed over 90% raw pitch accuracy under a stringent 10-cent threshold, demonstrating its precision for applications requiring exact pitch matching.

The noise robustness evaluation involved degrading audio signals with various noise types across different signal-to-noise ratios (SNRs). CREPE consistently outperformed pYIN and SWIPE except in specific low-frequency noise conditions (e.g., brown noise), underscoring its applicability in realistic audio environments.

Implications and Future Directions

CREPE's development marks a significant stride in pitch estimation technology by integrating deep learning principles. Its data-driven nature and robustness to diverse timbres and noise conditions present potential improvements in applications ranging from live audio performance analysis to speech processing on consumer devices.

Future research directions could focus on further enhancing robustness through techniques such as data augmentation and architectural innovations like recurrent connections for temporal smoothness. The promising results shown by CREPE suggest it could set a new benchmark standard in monophonic pitch estimation, driving advancements in both theoretical understanding and practical applications of audio signal processing technologies. Its release as an open-source Python module also promises to facilitate widespread adoption and further innovation.

PDF Markdown