Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Published 15 Jul 2021 in eess.AS and cs.SD | (2107.07503v1)

Abstract: Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the $T_{60}$ and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.

Abstract PDF Upgrade to Chat

Citations (35)

View on Semantic Scholar

Summary

The paper introduces FiNS, a novel deep learning framework that estimates RIRs directly from reverberant speech using filtered noise shaping.
It employs a time domain encoder-decoder architecture with multiresolution STFT loss to capture both impulsive and noise-like acoustic components.
FiNS outperforms existing models in replicating key acoustic parameters such as T60 and DRR, validated by objective metrics and listening tests.

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Introduction

The paper "Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech" (2107.07503) introduces FiNS, a novel deep learning framework designed to estimate Room Impulse Responses (RIRs) directly from reverberant speech. This research primarily aims at enhancing audio post-production and augmented reality applications by synthesizing accurate RIRs that mimic the acoustics of specific environments. This task is critical for applications like dereverberation, speech recognition, and virtual sound generation, yet traditional measurement techniques face limitations due to environmental constraints.

The core contribution lies in the architecture of FiNS, which consists of a time domain encoder and a filtered noise shaping decoder. FiNS models the RIR as a combination of filtered noise signals and early acoustic reflections, enabling efficient and realistic spatial transformations with a single convolution.

Figure 1: FiNS: Filtered noise shaping RIR synthesis network.

Methodology

The FiNS framework addresses the limitations of existing methods by proposing a blind estimation approach that bypasses the need for direct acoustic parameter measurement. The model considers the room as a linear time-invariant system, where the reverberant speech is the convolution of anechoic speech with the RIR. This approach allows transformation through simple convolution operations without the computational overhead of large models.

The encoder in FiNS employs strided 1-D convolutions to downsample the input signal, capturing both small and large time scales of the RIR. The decoder uses a noise shaping strategy, leveraging the physical properties of room acoustics where late reverberation is modeled as decaying noise. The model is trained using a multiresolution STFT loss to accurately capture both the impulsive and noise-like components of real-world RIRs.

Figure 2: Encoder and decoder block structures.

Results

The evaluation of FiNS demonstrates its effectiveness in generating high-fidelity RIRs that closely match the acoustic characteristics of target environments. Objective metrics indicate the model's proficiency in replicating parameters such as $T_{60}$ and Direct-to-Reverberant Ratio (DRR) with high accuracy. FiNS outperforms existing deep learning baselines, particularly in generating perceptually realistic RIRs without the ringing artifacts observed in simpler models like Wave-U-Net.

Figure 3: 2-D projections of embeddings from the encoder for unseen examples, colored by (a) ground-truth DRR and (b) room ID.

A subjective listening test further verifies the superiority of FiNS, where listeners rated the FiNS-generated RIRs as more acoustically faithful to the reference recordings than those produced by other models.

Figure 4: Listening test results.

Discussion and Future Work

The implications of this research are significant for advancements in virtual and augmented reality, particularly in enhancing the realism of audio experiences. FiNS provides a lightweight, efficient means of RIR reconstruction that is scalable and adaptable to various acoustic environments.

Future research directions may include exploring data augmentation techniques to cover a broader array of acoustic settings and employing adversarial training strategies to further refine synthesis quality. Integration with multimedia systems, leveraging audio-visual cues for enhanced environmental modeling, represents another promising avenue for development.

Conclusion

FiNS sets a new benchmark for time domain RIR estimation from reverberant speech, offering a comprehensive solution that combines computational efficiency with perceptual accuracy. This work not only advances current understanding of acoustic modeling but also opens up new possibilities for applications requiring realistic soundfield reproductions.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Summary

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Introduction

Methodology

Results

Discussion and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (3)

Collections

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Summary

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Introduction

Methodology

Results

Discussion and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections