ReverbMiipher: Generative Speech Restoration meets Reverberation Characteristics Controllability (2505.05077v1)

Published 8 May 2025 in cs.SD and eess.AS

Abstract: Reverberation encodes spatial information regarding the acoustic source environment, yet traditional Speech Restoration (SR) usually completely removes reverberation. We propose ReverbMiipher, an SR model extending parametric resynthesis framework, designed to denoise speech while preserving and enabling control over reverberation. ReverbMiipher incorporates a dedicated ReverbEncoder to extract a reverb feature vector from noisy input. This feature conditions a vocoder to reconstruct the speech signal, removing noise while retaining the original reverberation characteristics. A stochastic zero-vector replacement strategy during training ensures the feature specifically encodes reverberation, disentangling it from other speech attributes. This learned representation facilitates reverberation control via techniques such as interpolation between features, replacement with features from other utterances, or sampling from a latent space. Objective and subjective evaluations confirm ReverbMiipher effectively preserves reverberation, removes other artifacts, and outperforms the conventional two-stage SR and convolving simulated room impulse response approach. We further demonstrate its ability to generate novel reverberation effects through feature manipulation.

Summary

ReverbMiipher: Generative Speech Restoration Meets Reverberation Characteristics Controllability

The paper "ReverbMiipher: Generative Speech Restoration meets Reverberation Characteristics Controllability" presents an innovative approach to Speech Restoration (SR) that integrates the capability to preserve and manipulate reverberation characteristics within the restored speech. As reverberation encodes essential spatial information, traditional SR methods that aim to completely remove such attributes lose a valuable component of acoustic authenticity. This paper's central contribution, ReverbMiipher, introduces an SR model based on a generative resynthesis framework that effectively addresses this limitation.

Overview of the ReverbMiipher Model

ReverbMiipher extends the architecture of Miipher-2, a parametric resynthesis model, by incorporating a novel ReverbEncoder network designed to extract a dedicated reverb-feature vector from noisy speech input. This vector conditions a vocoder, resulting in a clean and reverberation-preserving output. A critical innovation in the training of ReverbMiipher is the stochastic zero-vector replacement strategy. This ensures the reverb feature specifically encodes reverberation, thus disentangling it from other speech attributes.

The ability of ReverbMiipher to control reverberation creates avenues for audio manipulation techniques, such as interpolation between features and latent space sampling. These capabilities not only enhance SR outcomes but also introduce methods for generating novel reverberation effects. Consequently, ReverbMiipher positions itself favorably against traditional two-stage SR processes and simulated room impulse response (RIR) convolution approaches.

Strong Numerical Results and Bold Claims

The paper reports compelling objective and subjective evaluation outcomes, establishing that ReverbMiipher surpasses conventional techniques in preserving reverberation characteristics without compromising speech integrity. In subjective rankings, ReverbMiipher stands out, yielding reverberations most similar to ground truth samples and confirming its superiority over competing models. Objective assessments reinforce these findings, showcasing the model’s adeptness in minimizing Mel-Cepstral Distortion (MCD) and preserving speaker similarity while controlling Gross Pitch Error (GPE).

Additionally, through Principal Component Analysis (PCA) visualization, ReverbMiipher demonstrates the continuous and quantifiable effect its reverb-feature vector has on standard acoustic metrics such as RT60 and Direct-to-Reverberant Ratio (DRR). This continuity underscores the model's capacity for nuanced control over reverberation.

Implications and Future Developments

The ReverbMiipher framework has profound implications for both practical and theoretical domains in speech processing. Practically, it offers enhanced capabilities for audio editing, immersive media applications, and realistic dataset creation for acoustic modeling. Theoretically, its design challenges prevailing assumptions about reverberation in SR, advocating instead for its preservation and manipulation as vital aspects of spatial audio representation.

The future of AI-driven SR can leverage ReverbMiipher’s architecture to explore control over other environmental acoustic features, such as background noise or directional sound characteristics. As generative models continue to interface with media creation and spatial acoustics, the methodologies proposed in this paper could inform audio synthesis in emerging AR/VR technologies and dynamic soundscapes in interactive systems.

ReverbMiipher epitomizes advancing generative model capabilities by integrating artistic control over acoustic elements and redefining SR's objectives. Its real-world application potential and ability to enrich multimedia experiences mark significant progress in the field of speech processing.