Reverberation Modeling for Source-Filter-based Neural Vocoder (2005.07379v1)

Published 15 May 2020 in cs.SD and eess.AS

Abstract: This paper presents a reverberation module for source-filter-based neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assumes a global time-invariant (GTI) RIR and directly learns the values of the RIR on a training dataset. The second approach assumes an utterance-level time-variant (UTV) RIR, which is invariant within one utterance but varies across utterances, and uses another neural network to predict the RIR values. We add the proposed reverberation module to the phase spectrum predictor (PSP) of a HiNet vocoder and jointly train the model. Experimental results demonstrate that the proposed module was helpful for modeling the reverberation effect and improving the perceived quality of generated reverberant speech. The UTV-RIR was shown to be more robust than the GTI-RIR to unknown reverberation conditions and achieved a perceptually better reverberation effect.

Authors (4)

Yang Ai (41 papers)
Xin Wang (1307 papers)
Junichi Yamagishi (178 papers)
Zhen-Hua Ling (114 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel reverberation module for source-filter neural vocoders that integrates GTI and UTV RIR methods.
It applies the module to HiNet’s phase spectrum predictor, achieving improved speech synthesis quality in diverse acoustic environments.
Experimental results show that the UTV approach outperforms the GTI method in adapting to unseen reverberation conditions, enhancing audio realism.

Reverberation Modeling in Source-Filter-Based Neural Vocoders

The paper focuses on advancing source-filter-based neural vocoding methods by addressing the challenge of modeling room reverberation. The significance of this work lies in its development of a sophisticated reverberation module compatible with neural vocoders, specifically enhancing vocoders like HiNet with room reverberation simulation capabilities. The context of this research emerges from the limitations of classical signal processing vocoders in generating natural-sounding speech, combined with the challenges posed by reverberation in realistic audio environments.

The presented reverberation module utilizes room impulse response (RIR) to produce reverberant outputs by convolving the RIR with the waveform generated by a neural vocoder. The contribution is distinguished by implementing two methodologies for parameterizing and estimating RIRs: the Global Time-Invariant (GTI) RIR and the Utterance-Level Time-Variant (UTV) RIR. The GTI approach assumes a static RIR across the entire dataset, while the UTV approach leverages a neural network to dynamically predict RIR variations specific to each utterance, acknowledging intra-dataset variability.

Experimentally integrated into HiNet's phase spectrum predictor (PSP), the module aims to improve the synthesis quality under reverberation conditions. Notable within the findings is the superior performance of UTV over GTI under unseen reverberation scenarios, highlighting UTV's robustness due to its adaptability to varying acoustic conditions. Numerical results indicate a perceptual enhancement in the quality of reverberant speech synthesized with the proposed module.

Practically, this research may lead to more realistic and higher-quality audio synthesis in environments with complex acoustic properties, crucial for telecommunication systems, virtual reality, and gaming industries. Theoretically, it opens pathways for further exploration into dynamic reverberation modeling, possibly incorporating real-time adaptive systems. Moreover, it could inspire new models that employ environment-aware feedback loops enhancing both RIR estimation precision and synthesis quality.

Future work may focus on refining the network architectures for RIR prediction, reducing computational complexity, and enhancing training strategies to better generalize across diverse acoustic conditions. This work also underscores the potential of integrating multi-task learning paradigms, as evidenced by the exploratory adoption of dry waveform training tasks for refining RIR estimations, though with marginal impact on perceptual quality in its current incarnation.

In summary, the paper effectively addresses a nuanced element of neural vocoding by incorporating an adaptive reverberation simulation mechanism, augmenting the realism and fidelity of synthesized speech in varied acoustic environments. This development resonates with the current trajectory towards increasing the sophistication and application range of neural network-based audio processing systems.

PDF Markdown

Related Papers

YouTube

Show All Videos