Phase-aware Speech Enhancement with Deep Complex U-Net (1903.03107v2)

Published 7 Mar 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.

PDF Abstract

Analysis of Non-Accessible arXiv Submission: (Choi et al., 2019 )v2

This analysis addresses the limitations in reviewing the paper tagged with arXiv identifier (Choi et al., 2019 )v2. It is noteworthy that, as of the time of this assessment, the PDF or source material for this submission is inaccessible, preventing a detailed evaluation of its claims, methodologies, and findings.

Observations

The arXiv entry provides minimal metadata, earmarked under the category of computational and language-focused studies. The submission appears related to computer science with a specialization in sound or audio processing, as suggested by the "cs.SD" classification code. However, without access to the textual or graphical content of the paper, an analysis remains largely speculative.

Implications and Speculative Considerations

Given the absence of content, one can only speculate on typical research endeavors within the "cs.SD" domain that might be addressed in this paper. These might include topics such as sound recognition, audio signal processing, or machine learning applications in auditory contexts. Theoretically, such a paper might propose novel algorithms or enhancements that address computational efficiencies or accuracies in sound processing systems.

Without specific insights into the paper's findings, it is challenging to postulate the direct implications or future developments in AI that the submission might have suggested. Generally, advancements in sound processing can have significant applications across diverse fields, including speech recognition, multimedia systems, and AI-driven accessibility tools.

Concluding Remarks

The inability to access the content of submission (Choi et al., 2019 )v2 underscores the importance of open-access and comprehensive archiving in the research community. This analysis serves not only as a placeholder for documentation purposes but also as a prompt for the necessity of infrastructural improvements on platforms such as arXiv. Future inquiries and potential peer engagements will benefit extensively from rectifying such access issues, thereby fostering more inclusive and collaborative academic discourse.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Hyeong-Seok Choi (16 papers)
Jang-Hyun Kim (11 papers)
Jaesung Huh (24 papers)
Adrian Kim (5 papers)
Jung-Woo Ha (67 papers)
Kyogu Lee (75 papers)

Citations (309)

View on Semantic Scholar

Phase-aware Speech Enhancement with Deep Complex U-Net (1903.03107v2)

Analysis of Non-Accessible arXiv Submission: (Choi et al., 2019 )v2

Observations

Implications and Speculative Considerations

Concluding Remarks

Related Papers