Speaker Recognition from Raw Waveform with SincNet (1808.00158v3)

Published 29 Jul 2018 in eess.AS, cs.LG, cs.SD, and eess.SP

Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

Citations (662)

View on Semantic Scholar

Summary

The paper introduces SincNet, which replaces standard CNN filters with parameterized sinc functions, dramatically reducing parameters while maintaining precision.
Experiments on TIMIT and Librispeech show impressive error rates of 0.85% to 0.96% in speaker identification, outperforming traditional CNN and MFCC-based methods.
The efficient design and interpretability of SincNet suggest its applicability to other audio tasks, including emotion recognition and music analysis.

An Analysis of SincNet for Speaker Recognition

Deep learning methodologies have steadily advanced the field of speaker recognition, increasingly favoring approaches that leverage raw audio inputs directly. In this context, the paper "Speaker Recognition from Raw Waveform with SincNet" presents a sophisticated convolutional neural network (CNN) architecture, SincNet, which utilizes parameterized sinc functions to deliver a more precise representation of the speaker's voice characteristics from raw waveforms. This architecture stands out by its compactness and efficiency in learning filter banks specific to speaker recognition, offering an alternative to traditional hand-engineered features such as MFCCs and FBANKs.

Key Contributions and Methodology

The main contribution of this work is the introduction of SincNet, which replaces the standard CNN initial convolutional layer with sinc-based convolutions. This substitution imposes constraints on the network's filter shapes, only learning the low and high cutoff frequencies of the band-pass filters. This design choice results in significant parameter reduction; for instance, with 80 filters each of length 251, SincNet uses only 160 parameters compared to the 8,000 parameters required in traditional CNNs of the same setup.

The paper highlights the performance of SincNet in speaker identification and verification tasks, particularly under conditions with minimal training data and short test sentence durations. With training inputs limited to 12-15 seconds per speaker and test sentences lasting 2-6 seconds, SincNet demonstrated faster convergence and superior performance over conventional CNNs and i-vector based systems.

Experimental Evaluation

Empirical evaluations were conducted using datasets such as TIMIT and Librispeech. In the speaker identification task, SincNet achieved a classification error rate (CER) of 0.85% and 0.96% on TIMIT and Librispeech, respectively, outperforming MFCC, FBANK, and standard CNN architectures. The speaker verification experiments reinforced these findings, where SincNet achieved an Equal Error Rate (EER) of 0.32%, showing notable improvements over other deep neural network systems.

Implications and Future Directions

SincNet's ability to learn more human-intuitive and interpretable filters suggests its potential applicability beyond speaker recognition to other domains requiring robust time-series processing, such as emotion recognition and music analysis. The architectural efficiency, especially its reduced parameter footprint, supports its utility in environments with constrained computational resources or limited training data availability.

Theoretically, the integration of domain knowledge in the form of sinc functions indicates a promising direction in neural network design — merging established signal processing techniques with data-driven learning. Future research should explore the adaptability of SincNet to diverse audio analytics tasks and examine its performance across more extensive and heterogeneous datasets like VoxCeleb.

In conclusion, this paper's findings advance the discourse on efficient and effective network architectures for audio-based task learning, advocating for continued exploration of structured convolutional layers that incorporate domain knowledge into learning processes. This work takes a significant step in redefining how machine learning models engage with raw audio data, setting a precedent for future exploration in the field.

PDF Markdown

Related Papers

YouTube

Show All Videos