Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction (2403.01226v1)

Published 2 Mar 2024 in cs.CV

Abstract: Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  2. Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
  3. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  4. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. Dynamic dual-output diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11482–11491, 2022.
  7. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In European Conference on Computer Vision, pages 170–188. Springer, 2022.
  8. What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
  9. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  10. Temporal-spatial feature pyramid for video saliency detection. arXiv preprint arXiv:2105.04213, 2021.
  11. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
  12. Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022.
  13. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision, 14(8):5–5, 2014.
  14. Multimodal saliency models for videos. In From Human Attention to Computational Attention, pages 291–304. Springer, 2016.
  15. Look around you: Saliency maps for omnidirectional images in vr applications. In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2017.
  16. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  17. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  18. Unified image and video saliency modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 419–435. Springer, 2020.
  19. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  20. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022a.
  21. Diffusioninst: Diffusion model for instance segmentation. arXiv preprint arXiv:2212.02773, 2022b.
  22. Creating summaries from user videos. In European conference on computer vision, pages 505–520. Springer, 2014.
  23. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  24. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
  25. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  26. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  27. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  28. Tinyhd: Efficient video saliency prediction with heterogeneous decoders using hierarchical maps distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2051–2060, 2023.
  29. Deep 360 pilot: Learning a deep agent for piloting through 360deg sports videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3451–3460, 2017.
  30. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3520–3527. IEEE, 2021.
  31. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication, 38:15–31, 2015.
  32. Vqbb: Image-to-image translation with vector quantized brownian bridge. arXiv preprint arXiv:2205.07680, 2022a.
  33. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022b.
  34. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
  35. Video saliency forecasting transformer. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6850–6862, 2022.
  36. Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition, 2009.
  37. Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2394–2403, 2019.
  38. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(1):1–23, 2016.
  39. A multimodal saliency model for videos with high audio-visual correspondence. IEEE Transactions on Image Processing, 29:3805–3819, 2020.
  40. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive computation, 3(1):5–24, 2011.
  41. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology, 29(3):773–786, 2018.
  42. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  43. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  44. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  45. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
  46. Saliency in vr: How do people explore virtual environments? IEEE transactions on visualization and computer graphics, 24(4):1633–1642, 2018.
  47. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  48. Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693, 2019.
  49. Separating style and content with bilinear models. Neural computation, 12(6):1247–1283, 2000.
  50. Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4766–4776, 2020.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  53. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 4894–4903, 2018.
  54. Revisiting video saliency prediction in the deep learning era. IEEE transactions on pattern analysis and machine intelligence, 43(1):220–237, 2019.
  55. Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia, 2021.
  56. The swiss army knife for image-to-image translation: Multi-task diffusion models. arXiv preprint arXiv:2204.02641, 2022.
  57. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
  58. Casp-net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6441–6450, 2023.
  59. Ecanet: Explicit cyclic attention-based network for video saliency prediction. Neurocomputing, 468:233–244, 2022.
  60. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
  61. Deep audio-visual fusion neural network for saliency estimation. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1604–1608. IEEE, 2021.
  62. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
  63. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
  64. Transformer-based multi-scale feature integration network for video saliency prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  65. Spatiotemporal visual saliency guided perceptual high efficiency video coding with neural network. Neurocomputing, 275:511–522, 2018.
Citations (2)

Summary

We haven't generated a summary for this paper yet.