Voice Signal Processing for Machine Learning. The Case of Speaker Isolation

Published 29 Mar 2024 in cs.SD, cs.LG, and eess.AS | (2403.20202v1)

Abstract: The widespread use of automated voice assistants along with other recent technological developments have increased the demand for applications that process audio signals and human voice in particular. Voice recognition tasks are typically performed using artificial intelligence and machine learning models. Even though end-to-end models exist, properly pre-processing the signal can greatly reduce the complexity of the task and allow it to be solved with a simpler ML model and fewer computational resources. However, ML engineers who work on such tasks might not have a background in signal processing which is an entirely different area of expertise. The objective of this work is to provide a concise comparative analysis of Fourier and Wavelet transforms that are most commonly used as signal decomposition methods for audio processing tasks. Metrics for evaluating speech intelligibility are also discussed, namely Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI). The level of detail in the exposition is meant to be sufficient for an ML engineer to make informed decisions when choosing, fine-tuning, and evaluating a decomposition method for a specific ML model. The exposition contains mathematical definitions of the relevant concepts accompanied with intuitive non-mathematical explanations in order to make the text more accessible to engineers without deep expertise in signal processing. Formal mathematical definitions and proofs of theorems are intentionally omitted in order to keep the text concise.