Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos (2303.16897v3)
Abstract: Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
- Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017.
- Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
- Visually indicated sound generation by perceptually optimized classification. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 349–357, 2017.
- Self-supervised dialogue learning for spoken conversational question answering. In INTERSPEECH, 2021.
- Generating visually aligned sound from videos. IEEE Transactions on Image Processing, 29:8292–8302, 2020.
- Grounding physical concepts of objects and events through dynamic visual reasoning. arXiv preprint arXiv:2103.16564, 2021.
- Diffimpact: Differentiable rendering and identification of impact sounds. In Conference on Robot Learning, pages 662–673. PMLR, 2022.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Foley music: Learning to generate music from videos. In European Conference on Computer Vision, pages 758–775. Springer, 2020.
- Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10478–10487, 2020.
- Threedworld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954, 2020.
- Look, listen, and act: Towards audio-visual embodied navigation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9701–9707. IEEE, 2020.
- Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7053–7062, 2019.
- Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
- Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10457–10467, 2020.
- Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- Audio vision: Using audio-visual synchrony to locate sounds. Advances in neural information processing systems, 12, 1999.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
- Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2595–2605, 2022.
- Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378–390, 2012.
- Diff-tts: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409, 2021.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
- Nu-wave: A diffusion probabilistic model for neural audio upsampling. arXiv preprint arXiv:2104.02321, 2021.
- Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019.
- Behrt: transformer for electronic health records. Scientific reports, 10(1):1–12, 2020.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Aligning source visual and target language domains for unpaired video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Auto-encoding knowledge graph for unsupervised medical report generation. Advances in Neural Information Processing Systems, 34, 2021.
- Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7834–7843, 2018.
- Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Symbolic music generation with diffusion models. arXiv preprint arXiv:2103.16091, 2021.
- Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8427–8436, 2018.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Synthesizing sounds from physically based motion. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 529–536, 2001.
- Synthesizing sounds from rigid-body simulations. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 175–181, 2002.
- Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016.
- Ambient sound provides supervision for visual learning. In European conference on computer vision, pages 801–816. Springer, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Example-guided physically based modal sound synthesis. ACM Transactions on Graphics (TOG), 32(1):1–16, 2013.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361. IEEE, 2019.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Audeo: Audio generation for a silent performance video. Advances in Neural Information Processing Systems, 33:3325–3337, 2020.
- How does it sound? Advances in Neural Information Processing Systems, 34:29258–29273, 2021.
- Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 247–263, 2018.
- A perceptually inspired generative model of rigid-body contact sounds. In The 22nd International Conference on Digital Audio Effects (DAFx-19), 2019.
- Foleyautomatic: physically-based sound effects for interactive simulation and animation. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 537–544, 2001.
- Performancenet: Score-to-audio music generation with multi-band convolutional residual network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1174–1181, 2019.
- Recursive visual sound separation using minus-plus net. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 882–891, 2019.
- Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020.
- Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018.
- MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI, pages 3985–3991, 2021.
- Mine your own anatomy: Revisiting medical image segmentation with extremely limited labels. arXiv preprint arXiv:2209.13476, 2022.
- Rethinking semi-supervised medical image segmentation: A variance-reduction perspective. arXiv preprint arXiv:2302.01735, 2023.
- Class-aware adversarial transformers for medical image segmentation. In Advances in Neural Information Processing Systems.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
- Vision-infused deep audio inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 283–292, 2019.
- Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3550–3558, 2018.
- Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18(3):351–376, 2021.