Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Visual Acoustic Matching (2307.15064v2)

Published 27 Jul 2023 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 106–110, 2019.
  2. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
  3. Visual acoustic matching. In CVPR, 2022.
  4. SoundSpaces: Audio-visual navigation in 3d environments. In ECCV, 2020.
  5. Novel-view acoustic synthesis. In CVPR, 2023.
  6. Soundspaces 2.0: A simulation platform for visual-acoustic learning. In NeurIPS, 2022.
  7. Learning audio-visual dereverberation. In ICASSP, 2023.
  8. Effects of Reverberation on the Directional Sensitivity of Auditory Neurons across the Tonotopic Axis: Influences of Interaural Time and Level Differences. Journal of Neuroscience, 30(23):7826–7837, 2010. Publisher: Society for Neuroscience _eprint: https://www.jneurosci.org/content/30/23/7826.full.pdf.
  9. Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 2016.
  10. Looking to listen at the cocktail party. ACM Transactions on Graphics, 37(4):1–11, jul 2018.
  11. A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Transactions on Audio, Speech, and Language Processing, 18(7):1766–1774, 2010.
  12. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In ICML, 2019.
  13. Metricgan+: An improved version of metricgan for speech enhancement. In InterSpeech, 2021.
  14. Metricgan-u: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech. In arxiv, 2021.
  15. Blind reverberation time estimation using a convolutional neural network. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pages 136–140, 2018.
  16. 2.5d visual sound. In CVPR, 2019.
  17. Generative adversarial networks. In NeurIPS, 2014.
  18. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  19. Self-supervised learning for speech enhancement through synthesis. In arxiv, 2022.
  20. Vilt: Vision-and-language transformer without convolution or region supervision, 2021.
  21. Real-time estimation of reverberation time for selection of suitable binaural room impulse responses. In Audio for Virtual, Augmented and Mixed Realities: Proceedings of 5th International Conference on Spatial Audio (ICSA 2019), pages 145–150, 2019.
  22. Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. In NeurIPS, 2021.
  23. Learning neural acoustic fields. In NeurIPS, 2022.
  24. Single-channel blind direct-to-reverberation ratio estimation using masking. In INTERSPEECH, pages 5066–5070, 2020.
  25. Few-shot audio-visual learning of environment acoustics. In NeurIPS, 2022.
  26. Blind estimation of the reverberation fingerprint of unknown acoustic environments. In Audio Engineering Society Convention 143. Audio Engineering Society, 2017.
  27. Disentangling speech from surroundings in a neural audio codec. In arxiv, 2022.
  28. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
  29. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  30. Speech resynthesis from discrete disentangled self-supervised representations. In InterSpeech, 2021.
  31. Unsupervised speech decomposition via triple information bottleneck. In arxiv, 2021.
  32. IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition. In Proc. Interspeech 2021, pages 286–290, 2021.
  33. Ts-rir: Translated synthetic room impulse responses for speech augmentation. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 259–266, 2021.
  34. Speechbrain. https://github.com/speechbrain/speechbrain, 2021.
  35. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In arxiv, 2021.
  36. Neural synthesis of binaural speech from mono audio. In International Conference on Learning Representations, 2021.
  37. Speech enhancement and dereverberation with diffusion-based generative models. In arxiv, 2022.
  38. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2:749–752 vol.2, 2001.
  39. Manfred R. Schroeder. New method of measuring reverberation time. In The Journal of the Acoustical Society of America 37, 409, 1965.
  40. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, 2021.
  41. Filtered noise shaping for time domain room impulse response estimation from reverberant speech. In 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 221–225, 2021.
  42. Acoustic matching by embedding impulse responses. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 426–430. IEEE, 2020.
  43. Environment aware text-to-speech synthesis. In InterSpeech, 2022.
  44. Audio-visual speech separation and dereverberation with a two-stage multimodal network. IEEE Journal of Selected Topics in Signal Processing, 14(3):542–553, mar 2020.
  45. Wavenet: A generative model for raw audio. In arXiv, 2016.
  46. One-shot voice conversion for style transfer based on speaker adaptation. In ICASSP, 2022.
  47. Joint estimation of reverberation time and early-to-late reverberation ratio from single-channel speech signals. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2):255–267, 2018.
  48. Speech enhancement based on denoising autoencoder with multi-branched encoders. In arxiv, 2020.
Citations (11)

Summary

We haven't generated a summary for this paper yet.