Glottal Closure and Opening Instant Detection from Speech Signals (2001.00841v1)

Published 28 Dec 2019 in cs.SD, cs.CL, and eess.AS

Abstract: This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization.

Citations (165)

View on Semantic Scholar

Summary

The paper presents a novel two-step method for detecting Glottal Closure Instants and Glottal Opening Instants directly from speech waveforms using mean-based signals and LP residual analysis.
Evaluation on the CMU ARCTIC database showed the method's superior GCI detection accuracy, higher noise robustness up to 0 dB SNR, and promising GOI identification compared to the DYPSA algorithm.
These improvements have significant implications for enhancing speech synthesis quality, speaker identification reliability, and other applications dependent on precise glottal event timing.

Detecting Glottal Events in Speech Signals: A Method for GCI and GOI Identification

The paper "Glottal Closure and Opening Instant Detection from Speech Signals" by Thomas Drugman and Thierry Dutoit presents a novel approach to detecting Glottal Closure Instants (GCIs) and Glottal Opening Instants (GOIs) directly from speech waveforms. This research is pertinent to numerous applications across speech processing, such as speech synthesis, voice transformation, and speaker identification, where accurate identification of these glottal events is crucial.

Methodological Overview

The authors introduce a two-step procedure aimed at differentiating this paper from previous methods, notably offering improvements over the established DYPSA algorithm. The process begins with the computation of a mean-based signal to initially estimate intervals where glottal events should occur. These intervals are derived by calculating the mean of sliding windowed speech segments. Subsequently, precise positions within these intervals are assigned by identifying the largest discontinuity within the Linear Prediction (LP) residual, which serves as an indicator of significant excitation in the glottal signal.

Numerical Results and Comparative Analysis

The effectiveness of the proposed method was evaluated on the CMU ARCTIC database, containing recordings from three speakers with different accents and genders. The detection accuracy was compared against the DYPSA algorithm using several metrics, such as Identification Rate (IDR), Miss Rate (MR), and False Alarm Rate (FAR). Strong numerical results demonstrated the proposed method's superior performance in detecting GCIs, evidenced by its higher IDR and lower MR across all speakers. Furthermore, the method showed promising GOI identification capabilities, although slightly less accurate compared to GCI.

Accuracy was further quantified in terms of timing error distributions. For GCIs, the method attained higher identification accuracy and a greater proportion of correct detections within a narrow timing error range compared to DYPSA. However, GOIs exhibited larger timing errors, reflecting the inherent complexity in detecting these instants, given their weaker and more dispersed excitation characteristics.

Additionally, the robustness of the proposed method was assessed under varying levels of additive Gaussian and babble noise. Remarkably, the method maintained its efficacy in identifying GCIs and GOIs, demonstrating resilience up to 0 dB SNR, whereas DYPSA's performance degraded significantly in noisier conditions.

Implications and Future Directions

The methodological advances presented in this paper have significant implications for speech processing technologies. The improved accuracy and noise robustness contribute to enhanced performance in applications that rely on precise glottal event detection, leading to better quality synthesized speech and more reliable speaker identification systems.

Going forward, the authors acknowledge room for refining GOI detection, suggesting potential future work in examining open quotient trajectories for better characterization of the glottal source. Furthermore, this method could be integrated with other source-filter deconvolution approaches to improve the analysis of the speech production mechanism.

This paper provides a substantial contribution to the domain of speech signal processing, with its robust methodology paving the way for enhanced applications in speech synthesis, coding, and transformation technologies.