The Audio-Visual BatVision Dataset for Research on Sight and Sound (2303.07257v3)
Abstract: Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound ph{\ae}nomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of-the-art work developed for simulated data can also succeed on our dataset. Project page: https://amandinebtto.github.io/Batvision-Dataset/
- Angelo Farina “Simultaneous measurement of impulse response and distortion with a swept-sine technique” In Audio engineering society convention 108, 2000 Audio Engineering Society
- “Multichannel audio database in various acoustic environments” In 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317 IEEE
- Siqi Zhang, Dominique Martinez and Jean-Baptiste Masson “Multi-robot searching with sparse binary cues and limited space perception” In Frontiers in Robotics and AI 2 Frontiers Media SA, 2015, pp. 12
- “Virtual worlds as proxy for multi-object tracking analysis” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4340–4349
- “Visually indicated sounds” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2405–2413
- “Matterport3D: Learning from RGB-D Data in Indoor Environments” In International Conference on 3D Vision (3DV), 2017 IEEE
- “Acoustic room modelling using a spherical camera for reverberant spatial audio objects” In Audio Engineering Society Convention 142, 2017 Audio Engineering Society
- “Mirage: Multichannel database of room impulse responses measured on high-resolution cube-shaped grid in multiple acoustic conditions” In arXiv preprint arXiv:1907.12421, 2019
- “Decoupled Weight Decay Regularization” In International Conference on Learning Representations, 2019
- “The Replica dataset: A digital replica of indoor spaces” In arXiv preprint arXiv:1906.05797, 2019
- “Soundspaces: Audio-visual navigation in 3d environments” In Computer Vision–ECCV, Proceedings, 2020 Springer
- Jesper Haahr Christensen, Sascha Hornauer and X Yu Stella “Batvision: Learning to see 3d spatial layout with two ears” In IEEE International Conference on Robotics and Automation (ICRA), 2020 IEEE
- Jesper Haahr Christensen, Sascha Hornauer and Stella Yu “BatVision with GCC-PHAT Features for Better Sound to Vision Predictions” In Sight & Sound, CVPR Workshops, 2020
- “Visualechoes: Spatial image representation learning through echolocation” In Proceedings of ECCV, 2020
- “Sim2real predictivity: Does evaluation in simulation predict real-world performance?” In IEEE Robotics and Automation Letters 5.4 IEEE, 2020, pp. 6670–6677
- Arun Balajee Vasudevan, Dengxin Dai and Luc Van Gool “Semantic object prediction and spatial sound super-resolution with binaural sounds” In Computer Vision–ECCV Proceedings, 2020 Springer
- “dEchorate: a calibrated room impulse response dataset for echo-aware signal processing” In EURASIP Journal on Audio, Speech, and Music Processing 2021 Springer, 2021, pp. 1–15
- Ziyang Chen, Xixi Hu and Andrew Owens “Structure from silence: Learning scene structure from ambient sound” In arXiv preprint arXiv:2111.05846, 2021
- Kranti Kumar Parida, Siddharth Srivastava and Gaurav Sharma “Beyond image to depth: Improving depth prediction using echoes” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8268–8277
- “Hearing what you cannot see: Acoustic vehicle detection around corners” In IEEE Robotics and Automation Letters 6.2 IEEE, 2021
- “Catchatter: Acoustic perception for mobile robots” In IEEE Robotics and Automation Letters 6.4 IEEE, 2021, pp. 7209–7216
- Francisco Rivera Valverde, Juana Valeria Hurtado and Abhinav Valada “There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
- “Soundspaces 2.0: A simulation platform for visual-acoustic learning” In arXiv preprint arXiv:2206.08312, 2022
- “Ego4d: Around the world in 3,000 hours of egocentric video” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18995–19012
- Go Irie, Takashi Shibata and Akisato Kimura “Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation” In Proceedings of ICASSP, 2022, pp. 4648–4652 IEEE
- “Learning Neural Acoustic Fields” In arXiv preprint arXiv:2204.00628, 2022
- “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events” In arXiv preprint arXiv:2206.01948, 2022
- “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
- “Stereo Depth Estimation with Echoes” In Computer Vision–ECCV 2022 Proceedings, 2022 Springer
- Lingyu Zhu, Esa Rahtu and Hang Zhao “Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision” In arXiv preprint arXiv:2207.01136, 2022
- “Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.1, 2023, pp. 123–136
- “Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations” In arXiv preprint arXiv:2301.02184, 2023
- Annamaria Mesaros, Toni Heittola and Tuomas Virtanen “TUT database for acoustic scene classification and sound event detection” In 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132 IEEE