Deep Learning for Audio Signal Processing (1905.00078v2)

Published 30 Apr 2019 in cs.SD, eess.AS, and stat.ML

Abstract: Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

PDF Abstract

Overview of "Deep Learning for Audio Signal Processing"

The paper "Deep Learning for Audio Signal Processing" serves as a comprehensive survey of deep learning methodologies applied to audio signal processing. The authors, Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüuter, Shuo-yiin Chang, and Tara Sainath, provide an in-depth examination of the state-of-the-art techniques employed in processing various audio domains, including speech, music, and environmental sounds. The review covers key methodologies, applications, and future directions in the field.

Key Themes and Methodologies

The paper emphasizes the diversity of tasks within audio signal processing, which are categorized based on the type of target prediction, including sequence classification, sequence labeling, and sequence transduction. The authors discuss how audio data is typically represented, using features such as log-mel spectra and raw waveform data, and outline the prevalent deep learning models including CNNs, RNNs, and GANs.

Model Discussion

Convolutional Neural Networks (CNNs): Utilized especially for tasks requiring the modeling of time-frequency patterns.
Recurrent Neural Networks (RNNs): Well-suited for tasks requiring context beyond fixed windows, with Long Short-Term Memory (LSTM) networks being particularly effective.
Generative Adversarial Networks (GANs): Explored for tasks such as source separation and enhancement, albeit less frequently than CNNs and RNNs.

Numerical Results and Applications

The paper illustrates several applications of deep learning in audio:

Automatic Speech Recognition (ASR): Significant word error rate reduction achieved through DNNs.
Music Information Retrieval (MIR): Successful application across tasks like chord and onset detection.
Environmental Sound Analysis: Enhanced performance using deep learning, although datasets remain a challenge.

Theoretical and Practical Implications

The authors identify several practical implications of the current research:

Feature Representation: While raw waveforms may theoretically provide better representation capability, log-mel spectrograms are prevalent due to their compactness and ease of use.
Data Requirements: Deep models require significant amounts of labeled data, presenting a challenge across all audio domains.
Model Complexity: Real-time applications necessitate computationally efficient models, especially for devices like mobile phones.

Future Directions

The paper concludes with a discussion on challenges that remain in the use of deep learning for audio signal processing:

Cross-Domain Transfer Learning: The potential of models trained on large datasets as a basis for other audio tasks remains largely untapped.
Interpretability and Explainability: Understanding and explaining model decisions to improve architecture and accountability.
Scalability and Efficiency: Developing architectures that can perform efficiently under resource constraints is critical for broader applicability.

This paper is a valuable resource for researchers and practitioners in audio signal processing, offering insights into the current state and potential future pathways for the integration of deep learning in this dynamic field.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Hendrik Purwins (5 papers)
Bo Li (1107 papers)
Tuomas Virtanen (112 papers)
Jan Schlüter (13 papers)
Shuo-yiin Chang (25 papers)
Tara Sainath (19 papers)

Citations (539)

View on Semantic Scholar