Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning (2305.01233v3)

Published 2 May 2023 in cs.CV and cs.MM

Abstract: We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4971–4980, 2018.
  2. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
  3. Learning from multiple partially observed views-an application to multilingual text categorization. Advances in neural information processing systems, 22:28–36, 2009.
  4. Stochastic optimization for multiview representation learning using partial least squares. In International Conference on Machine Learning, pp. 1786–1794. PMLR, 2016.
  5. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  535–541, 2006.
  6. Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of the European Conference on Computer Vision (ECCV), 2020a.
  7. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  721–725. IEEE, 2020b.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020c.
  9. nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks. IEEE Access, 8:161981–162003, 2020.
  10. Large scale audiovisual learning of sounds with weakly labeled data. arXiv preprint arXiv:2006.01595, 2020.
  11. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1933–1941, 2016.
  12. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  103–118, 2018.
  13. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  14. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2827–2836, 2016.
  15. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! arXiv preprint arXiv:2010.06572, 2020.
  16. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  17. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pp.  1440–1444. IEEE, 2019.
  18. What makes multimodal learning better than single (provably). arXiv preprint arXiv:2106.04538, 2021.
  19. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). arXiv preprint arXiv:2203.12221, 2022.
  20. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  21. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  22. Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502, 2021.
  23. Multiviz: An analysis benchmark for visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022.
  24. Graph distillation for action detection with privileged modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  166–183, 2018.
  25. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34, 2021.
  26. Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2015.
  27. Multimodal deep learning. In ICML, 2011.
  28. Adamml: Adaptive multi-modal learning for efficient video recognition. arXiv preprint arXiv:2105.05165, 2021.
  29. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp.  4980–4989, 2017.
  30. Balanced multimodal learning via on-the-fly gradient modulation. arXiv preprint arXiv:2203.15332, 2022.
  31. Gradient starvation: A learning proclivity in neural networks. arXiv preprint arXiv:2011.09468, 2020.
  32. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  6892–6899, 2019.
  33. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  34. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  35. Efficient rgb-d semantic segmentation for indoor scene analysis. arXiv preprint arXiv:2011.06961, 2020.
  36. The pitfalls of simplicity bias in neural networks. arXiv preprint arXiv:2006.07710, 2020.
  37. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp.  746–760. Springer, 2012.
  38. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
  39. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  40. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  41. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775, 2020.
  42. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  43. Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1828–1838, 2020a.
  44. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12695–12705, 2020b.
  45. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning, pp. 24043–24055. PMLR, 2022.
  46. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
  47. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chenzhuang Du (10 papers)
  2. Jiaye Teng (13 papers)
  3. Tingle Li (14 papers)
  4. Yichen Liu (54 papers)
  5. Tianyuan Yuan (7 papers)
  6. Yue Wang (675 papers)
  7. Yang Yuan (52 papers)
  8. Hang Zhao (156 papers)
Citations (28)