The Sound of Water: Inferring Physical Properties from Pouring Liquids (2411.11222v2)

Published 18 Nov 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

Summary

The paper establishes a theoretical framework that correlates pouring sound pitch with the liquid-container system’s physical attributes.
It introduces a wav2vec2-based neural network model, enhanced by visual co-supervision from DINO-based video analysis, to detect and predict dynamic properties.
Experiments, including tests on the LiquidPouring-50 dataset, demonstrate accurate estimation of parameters like air column height and volume flow rate.

Inferring Physical Properties from the Acoustic Signatures of Liquid Pouring

The paper "The Sound of Water: Inferring Physical Properties from Pouring Liquids" investigates the intriguing task of inferring the physical properties of a liquid-container system using audio-visual observations, focusing on sound. The authors propose a methodological framework to predict both static and dynamic properties of a liquid being poured into a container by analyzing the fundamental frequency or pitch produced during the act.

Key Contributions and Methodology

Theoretical Underpinning: The paper establishes a theoretical foundation that correlates the fundamental frequency of the sound emanating from pouring with the physical properties of the liquid-container system. The sound's pitch, as elucidated, is influenced by both the static characteristics of the container and dynamic parameters such as the flow rate of the liquid and the time to fill the container.
Pitch Detection Model: A neural network model, primarily based on wav2vec2 architecture, was developed to detect and analyze pitch variations from audio signals of liquid pouring. The model is pre-trained on synthetic data generated through a differentiable digital signal processing framework, enhancing its capacity for pitch detection in realistic settings.
Visual Co-Supervision: To bridge the simulation-to-reality gap, the paper introduces a unique training phase involving co-supervision from video data. A video network, pre-trained on visual data using a DINO-based architecture, supplies additional supervisory signals in pixel scale, thus enhancing audio model predictions through a scale-aware fine-tuning process.
Extensive Dataset Creation: A significant contribution is the creation and release of a proprietary dataset, LiquidPouring-50, consisting of 805 audio-visual recordings across 50 distinct containers. This dataset facilitates controlled research into the pitch and frequency distribution of pouring sounds.
Validation Across Diverse Scenarios: The model is shown to effectively generalize across various unseen container shapes and materials. Tests demonstrated capability adaptability in predicting pitch and other physical properties beyond the experimental setting—extending to in-the-wild datasets and differing container geometries.

Experimental Results and Implications

The model's efficacy is illustrated by its performance in accurately predicting the air column height with a mean absolute error of ±0.60cm in cylindrical containers. Furthermore, it demonstrates robustness in estimating dynamic properties like volume flow rate and time-to-fill with general applicability even beyond its initial training configuration.

The findings suggest substantial implications for enhancing robotic perception, particularly in tasks requiring multisensory integration where visual cues may be occluded or unreliable. This might be particularly useful in situations like autonomous pouring, where real-time adaptation to a plethora of container shapes and materials is necessary.

Future Directions

The paper uncovers various avenues for further exploration, particularly in extending the framework to accommodate more complex liquid dynamics as well as expanding it for other mundane tasks that possess rich physical interactions. The prospect of integrating additional sensory modalities, such as haptics, remains appealing. Moreover, the introduction of temporal dynamics in audio signals to infer broader environmental interactions could enhance the interpretative capacity of autonomous systems.

In conclusion, this paper significantly contributes to the intersection of acoustics, physics, and machine learning by demonstrating the utility of fundamental frequency to decode the physical world. Its interdisciplinary approach and comprehensive experiments provide a compelling case for the use of sound as a reliable metric for physical property estimation, signaling an engaging frontier in cognitive robotics and sensory learning.