Papers
Topics
Authors
Recent
Search
2000 character limit reached

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Published 27 Mar 2024 in cs.SD, cs.CV, cs.MM, and eess.AS | (2403.18821v1)

Abstract: We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Agisoft, LLC. Metashape 2.0, 2023.
  2. Novel-view acoustic synthesis from 3D reconstructed rooms. arXiv:2310.15130, 2023.
  3. Zip-NeRF: Anti-aliased grid-based neural radiance fields. In ICCV, 2023.
  4. OmniPhotos: Casual 360° VR photography. ACM Trans. Graph., 39(6):267:1–12, 2020.
  5. dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, 2021:1–15, 2021.
  6. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676, 2017.
  7. SoundSpaces: Audio-visual navigation in 3D environments. In ECCV, 2020a.
  8. SoundSpaces: Audio-visual navigation in 3D environments. In ECCV, pages 17–36. Springer, 2020b.
  9. Visual acoustic matching. In CVPR, pages 18858–18868, 2022a.
  10. Soundspaces 2.0: A simulation platform for visual-acoustic learning. In NeurIPS, pages 8896–8911, 2022b.
  11. Novel-view acoustic synthesis. In CVPR, pages 6409–6419, 2023a.
  12. Learning audio-visual dereverberation. In ICASSP, 2023b.
  13. Be everywhere-hear everything (BEE): Audio scene reconstruction by sparse audio-visual samples. In ICCV, pages 7853–7862, 2023c.
  14. Sound localization from motion: Jointly learning sound direction and camera rotation. In International Conference on Computer Vision (ICCV), 2023d.
  15. AdVerb: Visually guided audio dereverberation. In ICCV, pages 7884–7896, 2023.
  16. Sing: Symbol-to-instrument neural generator. In NeurIPS, 2018.
  17. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  18. Angelo Farina. Simultaneous measurement of impulse response and distortion with a swept-sine technique. In Audio engineering society convention 108. Audio Engineering Society, 2000.
  19. A broadband fast multipole accelerated boundary element method for the three dimensional helmholtz equation. The Journal of the Acoustical Society of America, 125(1):191–205, 2009.
  20. Neural 3D scene reconstruction with the Manhattan-world assumption. In CVPR, 2022.
  21. Multichannel audio database in various acoustic environments. In International Workshop on Acoustic Signal Enhancement (IWAENC), pages 313–317, 2014.
  22. Binaural technique—basic methods for recording, synthesis, and reproduction. Communication Acoustics, pages 223–254, 2005.
  23. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  24. Scalable inside-out image-based rendering. ACM Trans. Graph., 35(6):231:1–11, 2016.
  25. Egocentric scene reconstruction from an omnidirectional video. ACM Trans. Graph., 41(4):100:1–12, 2022.
  26. A binaural room impulse response database for the evaluation of dereverberation algorithms. In International Conference on Digital Signal Processing, 2009.
  27. Adam: A method for stochastic optimization. In International Conference on Learning Representation, 2015.
  28. A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, pages 5220–5224, 2017.
  29. MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
  30. A categorization of robust speech processing datasets. 2014.
  31. Micbots: collecting large realistic datasets for speech and audio research using mobile robots. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5635–5639. IEEE, 2015.
  32. Neural 3d video synthesis from multi-view video. In CVPR, pages 5521–5531, 2022.
  33. AV-NeRF: Learning neural fields for real-world audio-visual scene synthesis. arXiv preprint arXiv:2302.02088, 2023a.
  34. Neural acoustic context field: Rendering realistic room impulse response with neural fields. arXiv preprint arXiv:2309.15977, 2023b.
  35. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Learning neural acoustic fields. In NeurIPS, pages 3165–3177, 2022.
  37. Few-shot audio-visual learning of environment acoustics. In NeurIPS, pages 2522–2536, 2022.
  38. NeRF in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, pages 7210–7219, 2021.
  39. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  40. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  41. Atlas: End-to-end 3D scene reconstruction from posed images. In ECCV, 2020.
  42. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In LREC, pages 965–968, 2000.
  43. KinectFusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127–136, 2011.
  44. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph., 32(6):169:1–11, 2013.
  45. A system for acquiring, compressing, and rendering panoramic light field stills for virtual reality. ACM Trans. Graph., 37(6):197:1–15, 2018.
  46. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. arXiv:2109.08238, 2021.
  47. MESH2IR: Neural acoustic impulse response generator for complex 3D scenes. In Proceedings of the 30th ACM International Conference on Multimedia, pages 924–933, 2022a.
  48. Fast-rir: Fast neural diffuse room impulse response generator. In ICASSP, pages 571–575, 2022b.
  49. Av-rir: Audio-visual room impulse response estimation. arXiv preprint arXiv:2312.00834, 2023.
  50. Capture, reconstruction, and representation of the visual real world for virtual reality. In Real VR – Immersive Digital Reality: How to Import the Real World into Head-Mounted Immersive Displays, pages 3–32. Springer, 2020.
  51. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In ICASSP, pages 351–355, 2018.
  52. Interactive sound propagation and rendering for large multi-source scenes. ACM Transactions on Graphics (TOG), 36(4):1, 2016.
  53. Image-Based Rendering. Springer, 2007.
  54. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, pages 286–295, 2021.
  55. Self-supervised visual acoustic matching. arXiv preprint arXiv:2307.15064, 2023.
  56. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  57. INRAS: Implicit neural representation for audio scenes. In NeurIPS, 2022.
  58. iMAP: Implicit mapping and positioning in real-time. In ICCV, 2021.
  59. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. In CVPR, 2021.
  60. Advances in neural rendering. Comput. Graph. Forum, 41(2):703–735, 2022.
  61. Lonny L Thompson. A review of finite-element methods for time-harmonic acoustics. The Journal of the Acoustical Society of America, 119(3):1315–1330, 2006.
  62. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pages 1–4. IEEE, 2013.
  63. Soundcam: A dataset for finding humans using room acoustics. arXiv preprint arXiv:2311.03517, 2023.
  64. Scalable neural indoor scene rendering. ACM Trans. Graph., 41(4):98:1–16, 2022.
  65. Gibson env: Real-world perception for embodied agents. In CVPR, pages 9068–9079, 2018.
  66. VR-NeRF: High-fidelity virtualized walkable spaces. In SIGGRAPH Asia Conference Proceedings, 2023.
  67. Habitat-Matterport 3D semantics dataset. In CVPR, pages 4927–4936, 2023.
  68. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020.
  69. Noise-resilient reconstruction of panoramas and 3D scenes using robot-mounted unsynchronized commodity RGB-D cameras. ACM Trans. Graph., 2020.
  70. Surround by sound: A review of spatial audio recording and reproduction. Applied Sciences, 7(5):532, 2017.
Citations (10)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 27 likes about this paper.