DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper (2401.03650v1)
Abstract: Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.
- C. Laguna and A. Lerch, “An efficient algorithm for clipping detection and declipping audio,” in Audio Engineering Society Convention 141. Audio Engineering Society, 2016.
- “The effect of nonlinear distortion on the perceived quality of music and speech signals,” Journal of the Audio Engineering Society, vol. 51, no. 11, pp. 1012–1031, 2003.
- “Speech recognition performance estimation for clipped speech based on objective measures,” Acoustical Science and Technology, vol. 35, no. 6, pp. 324–326, 2014.
- “A survey and an extensive evaluation of popular audio declipping methods,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 5–24, 2020.
- “Sparsity and cosparsity for audio declipping: a flexible non-convex approach,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12. Springer, 2015, pp. 243–250.
- “Upglade: Unplugged plug-and-play audio declipper based on consensus equilibrium of dnn and sparse optimization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Applade: Adjustable plug-and-play audio declipper combining dnn with sparse optimization,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1011–1015.
- “Distortion audio effects: Learning how to recover the clean signal,” in 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022.
- A. A. Nair and K. Koishida, “Cascaded time+ time-frequency unet for speech enhancement: Jointly addressing clipping, codec distortions, and gaps,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7153–7157.
- “Solving audio inverse problems with a diffusion model,” arXiv preprint arXiv:2210.15228, 2022.
- “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109.13731, 2021.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
- “Real time speech enhancement in the waveform domain,” Proc. Interspeech 2020, pp. 3291–3295, 2020.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- “Hifi-gan-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 166–170.
- “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
- “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
- C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and tts models, [dataset],” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2016.
- “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
- “webmushra—a comprehensive framework for web-based listening tests,” Journal of Open Research Software, vol. 6, no. 1, 2018.
- “Method for the subjective assessment of intermediate quality level of audio systems,” Rec. ITU-R BS.1534-3, ITU, Oct. 2015.
- ITU-T, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
- “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
- “Audio Similarity is Unreliable as a Proxy for Audio Quality,” in Proc. Interspeech 2022, 2022, pp. 3553–3557.
- W. Mack and E. A. P. Habets, “Declipping speech using deep filtering,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 200–204.