DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction (2403.01226v1)
Abstract: Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.
- Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
- Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dynamic dual-output diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11482–11491, 2022.
- Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In European Conference on Computer Vision, pages 170–188. Springer, 2022.
- What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Temporal-spatial feature pyramid for video saliency detection. arXiv preprint arXiv:2105.04213, 2021.
- Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
- Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022.
- How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision, 14(8):5–5, 2014.
- Multimodal saliency models for videos. In From Human Attention to Computational Attention, pages 291–304. Springer, 2016.
- Look around you: Saliency maps for omnidirectional images in vr applications. In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2017.
- Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Unified image and video saliency modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 419–435. Springer, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022a.
- Diffusioninst: Diffusion model for instance segmentation. arXiv preprint arXiv:2212.02773, 2022b.
- Creating summaries from user videos. In European conference on computer vision, pages 505–520. Springer, 2014.
- Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
- Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
- Tinyhd: Efficient video saliency prediction with heterogeneous decoders using hierarchical maps distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2051–2060, 2023.
- Deep 360 pilot: Learning a deep agent for piloting through 360deg sports videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3451–3460, 2017.
- Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3520–3527. IEEE, 2021.
- A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication, 38:15–31, 2015.
- Vqbb: Image-to-image translation with vector quantized brownian bridge. arXiv preprint arXiv:2205.07680, 2022a.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022b.
- Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
- Video saliency forecasting transformer. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6850–6862, 2022.
- Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition, 2009.
- Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2394–2403, 2019.
- Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(1):1–23, 2016.
- A multimodal saliency model for videos with high audio-visual correspondence. IEEE Transactions on Image Processing, 29:3805–3819, 2020.
- Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive computation, 3(1):5–24, 2011.
- Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773–786, 2018.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
- Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
- Saliency in vr: How do people explore virtual environments? IEEE transactions on visualization and computer graphics, 24(4):1633–1642, 2018.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693, 2019.
- Separating style and content with bilinear models. Neural computation, 12(6):1247–1283, 2000.
- Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4766–4776, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 4894–4903, 2018.
- Revisiting video saliency prediction in the deep learning era. IEEE transactions on pattern analysis and machine intelligence, 43(1):220–237, 2019.
- Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia, 2021.
- The swiss army knife for image-to-image translation: Multi-task diffusion models. arXiv preprint arXiv:2204.02641, 2022.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
- Casp-net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6441–6450, 2023.
- Ecanet: Explicit cyclic attention-based network for video saliency prediction. Neurocomputing, 468:233–244, 2022.
- Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
- Deep audio-visual fusion neural network for saliency estimation. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1604–1608. IEEE, 2021.
- Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
- Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
- Transformer-based multi-scale feature integration network for video saliency prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Spatiotemporal visual saliency guided perceptual high efficiency video coding with neural network. Neurocomputing, 275:511–522, 2018.