Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D Audio-Visual Segmentation (2411.02236v1)

Published 4 Nov 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.
  2. BBC Sound Effects. BBC Sound Effects Archive. https://sound-effects.bbcrewind.co.uk/, 2024.
  3. Leveraging foundation models for unsupervised audio-visual segmentation. ArXiv, abs/2309.06728, 2023. URL https://api.semanticscholar.org/CorpusID:261705977.
  4. Soundspaces 2.0: A simulation platform for visual-acoustic learning. In NeurIPS 2022 Datasets and Benchmarks Track, 2022.
  5. Unraveling instance associations: A closer look for audio-visual segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26497–26507, 2024.
  6. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996.
  7. Audio–visual segmentation based on robust principal component analysis. Expert Systems with Applications, 256:124885, 2024.
  8. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia (MM’13), pages 411–412, New York, NY, USA, October 21–25 2013. ACM. doi: 10.1145/2502081.2502245.
  9. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  10. Point sample rendering. In Rendering Techniques’ 98: Proceedings of the Eurographics Workshop. Springer, 1998.
  11. Separating the “chirp” from the “chat”: Self-supervised visual grounding of sound and language. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13117–13127, 2024. URL https://api.semanticscholar.org/CorpusID:270372064.
  12. Improving audio-visual segmentation with bidirectional generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2067–2075, 2024.
  13. Sagd: Boundary-enhanced segment anything in 3d gaussian via gaussian decomposition, 2024. URL https://arxiv.org/abs/2401.17857. arXiv preprint arXiv:2401.17857.
  14. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. arXiv preprint arXiv:2312.02126, 2023.
  15. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 2023a.
  16. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023b. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  17. Segment anything. arXiv:2304.02643, 2023.
  18. Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812, 2023.
  19. Bavs: bootstrapping audio-visual segmentation by integrating foundation knowledge. IEEE Transactions on Multimedia, 2024a.
  20. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5604–5614, 2024b.
  21. Class-agnostic object detection with multi-modal transformer. In 17th European Conference on Computer Vision (ECCV). Springer, 2022.
  22. Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579, 2023.
  23. Gaussian splatting slam. arXiv preprint arXiv:2312.06741, 2023.
  24. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  25. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023.
  26. Weakly-supervised audio-visual segmentation. Advances in Neural Information Processing Systems, 36, 2024.
  27. Av-sam: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
  28. Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press, 2015. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806390. URL http://dl.acm.org/citation.cfm?doid=2733373.2806390.
  29. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
  30. Neural volumetric object selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. († alphabetic ordering).
  31. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2):169–194, 1998. doi: 10.1023/A:1009745219419.
  32. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  33. Flashsplat: 2d to 3d gaussian splatting segmentation solved optimally. ECCV, 2024.
  34. Cross-modal cognitive consensus guided audio-visual segmentation. arXiv preprint arXiv:2310.06259, 2023.
  35. Sketchfab. Sketchfab 3d models. https://sketchfab.com, 2024.
  36. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  37. Immediate perceptual response to intersensory discrepancy. Psychological bulletin, 88(3):638, 1980.
  38. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  39. Cooperation does matter: Exploring multi-order bilateral relations for audio-visual segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27134–27143, 2024a.
  40. Rila: Reflective and imaginative language agent for zero-shot semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16251–16261. IEEE, 2024b.
  41. Gaussian grouping: Segment and edit anything in 3d scenes. ECCV, 2024.
  42. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:7239–7257, 2022a. URL https://api.semanticscholar.org/CorpusID:258510218.
  43. Audio-visual segmentation. In European Conference on Computer Vision, 2022b.
  44. Surface splatting. In Annual conference on Computer graphics and interactive techniques, 2001.

Summary

  • The paper presents a novel 3D audio-visual segmentation framework that integrates 2D foundation models with 3D spatial mapping.
  • It employs Gaussian Splatting and an Audio-Informed Spatial Refinement Module to enhance segmentation accuracy in complex scenarios.
  • Experimental evaluations on the 3DAVS-S34-O7 benchmark show significant improvements in both single-instance and multi-instance settings.

Overview of "3D Audio-Visual Segmentation"

In the paper titled "3D Audio-Visual Segmentation," the authors investigate the longstanding challenge of recognizing and segmenting sounding objects within 3D environments using synchronized audio-visual inputs. This research addresses the limitations of traditional 2D audio-visual segmentation (AVS) models, which lack the spatial mapping necessary for real-world applications. The paper extends AVS to three dimensions, thus laying the foundation for advancements in areas such as robotics and AR/VR/MR.

Novelty and Methodology

The core innovation of this work is the introduction of 3D Audio-Visual Segmentation (3D AVS), which integrates spatial audio cues with 3D visual contexts to better localize and segment sounding objects. To support their research, the authors develop the 3DAVS-S34-O7 benchmark, utilizing the Habitat simulator to create photorealistic 3D environments featuring diverse acoustic scenarios.

At the heart of their methodology is EchoSegnet, a training-free 3D segmentation approach that capitalizes on existing pretrained 2D audio-visual foundation models. This method involves several key steps:

  • It initially uses 2D foundation models to identify sounding object masks from input RGB frames.
  • These 2D masks are then translated into a 3D representation via Gaussian Splatting (3D-GS), creating a consistent multi-view 3D segmentation.
  • To refine the initial 3D segmentation, an Audio-Informed Spatial Refinement Module (AISRM) is introduced, which leverages spatial audio to enhance segmentation accuracy, particularly in multi-instance scenarios where noise and occlusion are prevalent.

The AISRM corrects ambiguous segmentations by utilizing spatial relationships and audio intensity maps, ensuring only the sound-emitting instances are accurately segmented.

Experimental Evaluation

The authors conduct comprehensive experiments on their novel benchmark, where EchoSegnet demonstrates superior performance in both single-instance and multi-instance settings compared to existing 2D AVS and SSL models. The effectiveness of the AISRM is highlighted through comparisons that show significant improvements in metrics such as mIoU and F-Score when AISRM is integrated into EchoSegnet.

Implications and Future Directions

The implications of this research are twofold: it contributes a robust 3D audio-visual segmentation framework that exploits spatial audio for more accurate 3D representations and provides a dataset that enables further research into embodied AI applications. Practically, this work suggests improvements for AI systems in navigation and interaction within 3D environments. Theoretically, it builds a bridge between 2D and 3D audio-visual understanding, expanding the scope of multimodal AI.

Future explorations might involve extending 3D AVS to incorporate more complex and dynamic acoustic scenarios, potentially involving moving sources and changing environments. Additionally, integrating EchoSegnet with other modalities or AI systems could enhance its applicability in domains beyond robotics and AR/VR.

In conclusion, "3D Audio-Visual Segmentation" presents a significant advancement in embodied AI by proposing methodologies and tools to bridge the current gap between 2D AVS and more sophisticated, spatially-aware 3D environments.