BatVision: Learning to See 3D Spatial Layout with Two Ears (1912.07011v3)

Published 15 Dec 2019 in cs.CV, cs.RO, cs.SD, and eess.AS

Abstract: Many species have evolved advanced non-visual perception while artificial systems fall behind. Radar and ultrasound complement camera-based vision but they are often too costly and complex to set up for very limited information gain. In nature, sound is used effectively by bats, dolphins, whales, and humans for navigation and communication. However, it is unclear how to best harness sound for machine perception. Inspired by bats' echolocation mechanism, we design a low-cost BatVision system that is capable of seeing the 3D spatial layout of space ahead by just listening with two ears. Our system emits short chirps from a speaker and records returning echoes through microphones in an artificial human pinnae pair. During training, we additionally use a stereo camera to capture color images for calculating scene depths. We train a model to predict depth maps and even grayscale images from the sound alone. During testing, our trained BatVision provides surprisingly good predictions of 2D visual scenes from two 1D audio signals. Such a sound to vision system would benefit robot navigation and machine vision, especially in low-light or no-light conditions. Our code and data are publicly available.

Citations (52)

View on Semantic Scholar

Summary

The paper presents BatVision, a system that uses binaural audio and an encoder-decoder neural network with adversarial training to predict 3D depth maps and grayscale images of spatial layouts.
Results demonstrate feasible reconstruction of indoor structures from sound alone, showing improved accuracy and fidelity through adversarial discrimination against baseline methods.
This bat-inspired approach offers potential for robotic navigation in low-light environments and embedded systems, despite limitations caused by complex acoustics and material properties.

BatVision: Learning to See 3D Spatial Layout with Two Ears

The paper "BatVision: Learning to See 3D Spatial Layout with Two Ears" presents an innovative approach to machine perception by leveraging low-cost consumer-grade hardware inspired by bats' echolocation abilities. The primary objective is to reconstruct visual scenes using binaural audio signals, offering potential applications in environments with challenging lighting conditions.

Methodology

The BatVision system utilizes two microphones integrated into artificial human ears to receive echoes from sound chirps emitted by a speaker. During the training phase, a stereo camera captures ground-truth images that facilitate model training to predict depth maps and grayscale images from audio data. The core of the system is an encoder-decoder architecture fused with an adversarial discriminator to enhance the accuracy and quality of predicted visual outputs. Multiple configurations of audio inputs, such as raw waveforms and amplitude spectrograms, are evaluated, employing both UNet-style and direct upsampling generators.

Results

The research demonstrates a feasible reconstruction of indoor spatial layouts with substantial accuracy using sound alone. Predicted depth maps consistently identify structural elements like walls, hallways, and major furnishings, although finer details in obstacle mapping remain abstract. Quantitative measures indicate that the system performs well, with different network configurations showing varying degrees of efficacy, which are extensively bench-marked against trivial baseline reconstructions. The addition of adversarial discriminators significantly refines output quality by enforcing realistic patch-level fidelity.

Limitations

Several constraints inherent to sound propagation impact reconstruction quality adversely. Material dampening, complex environments with dense obstacles, and corners that cause sound scattering present notable challenges. Short-range interactions often yield complicated echo profiles due to multi-path convergence, demanding further refinement in algorithmic handling of such scenarios.

Implications and Future Work

The implications for robotics and machine vision are significant, particularly in navigation systems where BatVision might supplement vision sensors in low-light conditions, providing an alternative perceptual mechanism. The approach, characterized by its simplicity and deployability, suggests promising directions for embedded systems requiring real-time spatial awareness. Future developments could explore improved handling of complex acoustic environments, remove reliance on a synchronized stereo camera during training, and investigate expanded application scopes including autonomous vehicles and assistive technologies for the visually impaired.

BatVision underscores the potential in adapting echolocation principles to artificial systems, setting the stage for further exploration in sound-based spatial perception technologies. The availability of the authors' code further encourages community involvement in advancing this field. Such systems not only provide an intriguing cross-domain translation solution but also challenge conventional paradigms in understanding and mimicking biological sensory capabilities.

Related Papers

YouTube

Show All Videos