Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Variational Auto-encoder based Audio-Visual Segmentation (2310.08303v1)

Published 12 Oct 2023 in cs.CV, cs.SD, and eess.AS

Abstract: We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Audio-visual segmentation. In European Conference on Computer Vision (ECCV), 2022.
  2. Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579, 2023.
  3. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  4. Multimodal generative models for scalable weakly-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  5. Multimodal generative learning utilizing jensen-shannon-divergence. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  6. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  7. Look, listen and learn. In IEEE International Conference on Computer Vision (ICCV), 2017.
  8. Objects that sound. In European Conference on Computer Vision (ECCV), 2018.
  9. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
  10. Audiovisual transformer with instance attention for audio-visual event localization. In Asian Conference on Computer Vision (ACCV), 2020.
  11. Audio-visual event localization in unconstrained videos. In European Conference on Computer Vision (ECCV), 2018.
  12. Positive sample propagation along the audio-visual event line. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  13. Cross-modal attention network for temporal inconsistent audio-visual event localization. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
  14. Mm-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing. In ACM International Conference on Multimedia (MM), 2022.
  15. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In European Conference on Computer Vision (ECCV), 2020.
  16. Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  17. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  18. Attention is not enough: Mitigating the distribution discrepancy in asynchronous multimodal sequence fusion. In IEEE International Conference on Computer Vision (ICCV), 2021.
  19. Multiple sound sources localization from coarse to fine. In European Conference on Computer Vision (ECCV), 2020.
  20. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  21. Attention bottlenecks for multimodal fusion. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems (NeurIPS), 2021.
  22. Learning multimodal VAEs through mutual supervision. In International Conference on Learning Representations (ICLR), 2022.
  23. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
  24. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning (ICML), 2019.
  25. Smil: Multimodal learning with severely missing modality. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
  26. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  27. Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, 2002.
  28. Generalized product of experts for automatic and principled fusion of gaussian process predictions. arXiv preprint arXiv:1410.7827, 2014.
  29. Variational mixture-of-experts autoencoders for multi-modal deep generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  30. Uc-net: Uncertainty inspired rgb-d saliency detection via conditional variational autoencoders. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  31. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, pages 3354–3359, 2014.
  32. C-mi-gan : Estimation of conditional mutual information using minmax formulation. In International Conference on Uncertainty in Artificial Intelligence (UAI), 2020.
  33. Ccmi : Classifier based conditional mutual information estimation. In International Conference on Uncertainty in Artificial Intelligence (UAI), 2020.
  34. Model-augmented conditional mutual information estimation for feature selection. In International Conference on Uncertainty in Artificial Intelligence (UAI), 2020.
  35. On variational bounds of mutual information. In International Conference on Machine Learning (ICML), pages 5171–5180, 2019.
  36. The im algorithm: A variational approach to information maximization. In Advances in Neural Information Processing Systems (NeurIPS), pages 201–208, 2003.
  37. Club: A contrastive log-ratio upper bound of mutual information. In International Conference on Machine Learning (ICML), 2020.
  38. Learning disentangled representations via mutual information estimation. In European Conference on Computer Vision (ECCV), 2020.
  39. Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual tts. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021.
  40. Information bottleneck disentanglement for identity swapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  41. Mutual information maximization on disentangled representations for differential morph detection. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2021.
  42. Disentangled representation learning with wasserstein total correlation. arXiv preprint arXiv:1912.12818, 2019.
  43. Deep mutual information maximin for cross-modal clustering. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
  44. Multimodal representation learning via maximization of local mutual information. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 273–283, 2021.
  45. Mimf: Mutual information-driven multimodal fusion. In Cognitive Systems and Signal Processing, pages 142–150, 2021.
  46. Improving multimodal fusion via mutual dependency maximisation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 231–245, 2021.
  47. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
  48. Learning factorized multimodal representations. In International Conference on Learning Representations (ICLR), 2019.
  49. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9180–9192, 2021.
  50. Private-shared disentangled multimodal vae for learning of latent representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, pages 1692–1700, 2021.
  51. Self-supervised disentanglement of modality-specific and shared factors improves multimodal generative models. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR, 2020.
  52. On mutual information maximization for representation learning. In International Conference on Learning Representations (ICLR), 2020.
  53. Improved mutual information estimation. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
  54. Mutual information neural estimation. In International Conference on Machine Learning (ICML), pages 531–540, 2018.
  55. Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  56. Rankmi: A mutual information maximizing ranking loss. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  57. Clustering by maximizing mutual information across views. In IEEE International Conference on Computer Vision (ICCV), 2021.
  58. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  59. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (ICLR), 2019.
  60. Improving adversarial robustness via mutual information estimation. In International Conference on Machine Learning (ICML), 2022.
  61. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), 2017.
  62. Domain separation networks. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  63. Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12793–12802, June 2021.
  64. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS), 2016.
  65. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022.
  66. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  67. Cnn architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
  68. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
  69. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), 2013.
  70. Panoptic feature pyramid networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  71. F33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTnet: fusion, feedback and focus for salient object detection. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
  72. Making a case for 3d convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516, 2020.
  73. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  74. Generative transformer for accurate and reliable salient object detection. arXiv preprint arXiv:2104.10127, 2021.
  75. Learning generative vision transformer with energy-based latent space for saliency prediction. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  76. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  77. Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  78. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  79. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  80. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  81. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  82. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
  83. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
  84. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), pages 2256–2265, 2015.
  85. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  86. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), pages 11895–11907, 2019.
  87. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
Citations (24)

Summary

We haven't generated a summary for this paper yet.