Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques (1003.4083v1)

Published 22 Mar 2010 in cs.MM

Abstract: Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology. The voice is a signal of infinite information. A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal. Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal. Several methods such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) and etc are evaluated with a view to identify a straight forward and effective method for voice signal. The extraction and matching process is implemented right after the Pre Processing or filtering signal is performed. The non-parametric method for modelling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) are utilize as extraction techniques. The non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it's obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance.This paper present the viability of MFCC to extract features and DTW to compare the test patterns.

Citations (1,061)

View on Semantic Scholar

Summary

The paper demonstrates that combining MFCC feature extraction with DTW matching effectively recognizes voice commands across varied speech rates.
It details a systematic methodology using MATLAB, incorporating pre-emphasis, framing, FFT, and DCT to enhance feature clarity.
Experiments show that the MFCC-DTW approach minimizes recognition errors amid speaker variability, highlighting its practical applications in voice-activated systems.

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

The paper under review presents an evaluation of voice recognition algorithms that leverage Mel Frequency Cepstral Coefficients (MFCCs) for feature extraction and Dynamic Time Warping (DTW) for feature matching. This combination aims to offer a straightforward yet effective approach to voice signal processing, circumventing the complexity inherent in direct analysis of voice signals due to their rich information content.

Key Techniques

Feature Extraction with MFCC

The role of MFCCs in voice recognition is fundamental. MFCCs exploit the human auditory system's characteristics, specifically our inability to perceive frequencies over 1 kHz. This method divides the frequency spectrum into linearly spaced filters below 1 kHz and logarithmically spaced filters above it. Such an arrangement aligns with the Mel frequency scale, which effectively captures phonetic characteristics.

The MFCC algorithm performs several steps:

Pre-emphasis: Increases the energy of higher frequencies by processing signals through a preparatory filter.
Framing: Segments the speech signal into small frames, typically between 20 to 40 ms.
Windowing: Uses Hamming windows to minimize signal discontinuities at the boundaries of frames.
Fast Fourier Transform (FFT): Converts time-domain frames to the frequency domain.
Mel Filter Bank Processing: Applies triangular filters as per the Mel scale to derive the Mel spectrum.
Discrete Cosine Transform (DCT): Converts the log Mel spectrum into the time domain, yielding MFCC features.
Delta and Delta-Delta Features: Incorporates the changes in cepstral features over time to improve robustness against variations.

Feature Matching with DTW

DTW is employed to measure similarities between two temporal sequences which may vary in speed. This technique finds the optimal alignment by non-linearly warping the time axis of the sequences. The process includes constructing a matrix to compute distances between sequence elements and then determining the minimum warping path to align them effectively.

The DTW algorithm is advantageous for handling variations in speaking rates and other minor discrepancies between test and reference signals. It ensures that even if two time series are not aligned temporally, their inherent similarity can still be measured accurately.

Methodology

The paper's methodology comprised a structured approach, leveraging MATLAB for implementing MFCC and DTW algorithms. The dataset included voice recordings from both male and female speakers uttering phrases related to television control commands. The recognition experiments involved two phases: training, ostensibly to capture and store voice features, and testing, wherein the stored features were matched against new input using DTW.

Results

The experiments demonstrated that the MFCC-DTW combination could effectively recognize voice commands despite variations in utterance speeds and minor differences in the phonetic components of words. The DTW algorithm successfully aligned the temporal sequences of test inputs to reference templates, minimizing recognition errors and confirming the technique's appropriateness for this application.

Implications and Future Work

The findings suggest the MFCC-DTW approach is robust in typical voice recognition scenarios, particularly where variability in speech patterns exists. The practical utility of these methods is notable in environments requiring quick and accurate voice command recognition, such as voice-activated control systems.

Future research directions could explore integrating other models like Linear Predictive Coding (LPC), Hidden Markov Models (HMM), and Artificial Neural Networks (ANN) to possibly enhance recognition rates or streamline computational efficiency further. Investigating adaptations of these techniques in noisy environments or across various languages could also yield significant insights.

Conclusion

This paper demonstrates the viability of using MFCC for feature extraction and DTW for feature matching in voice recognition systems. The proposed methods exhibit strong performance in recognizing voice commands accurately and efficiently, making notable contributions to the field of digital speech processing and recognition.

PDF Markdown