On Uni-Modal Feature Learning in Supervised Multi-Modal Learning (2305.01233v3)
Abstract: We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980, 2018.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
- Learning from multiple partially observed views-an application to multilingual text categorization. Advances in neural information processing systems, 22:28–36, 2009.
- Stochastic optimization for multiview representation learning using partial least squares. In International Conference on Machine Learning, pp. 1786–1794. PMLR, 2016.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, 2006.
- Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of the European Conference on Computer Vision (ECCV), 2020a.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020b.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020c.
- nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks. IEEE Access, 8:161981–162003, 2020.
- Large scale audiovisual learning of sounds with weakly labeled data. arXiv preprint arXiv:2006.01595, 2020.
- Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941, 2016.
- Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118, 2018.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2827–2836, 2016.
- Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! arXiv preprint arXiv:2010.06572, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1440–1444. IEEE, 2019.
- What makes multimodal learning better than single (provably). arXiv preprint arXiv:2106.04538, 2021.
- Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). arXiv preprint arXiv:2203.12221, 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502, 2021.
- Multiviz: An analysis benchmark for visualizing and understanding multimodal models. arXiv preprint arXiv:2207.00056, 2022.
- Graph distillation for action detection with privileged modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 166–183, 2018.
- Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34, 2021.
- Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2015.
- Multimodal deep learning. In ICML, 2011.
- Adamml: Adaptive multi-modal learning for efficient video recognition. arXiv preprint arXiv:2105.05165, 2021.
- Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 4980–4989, 2017.
- Balanced multimodal learning via on-the-fly gradient modulation. arXiv preprint arXiv:2203.15332, 2022.
- Gradient starvation: A learning proclivity in neural networks. arXiv preprint arXiv:2011.09468, 2020.
- Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6892–6899, 2019.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Efficient rgb-d semantic segmentation for indoor scene analysis. arXiv preprint arXiv:2011.06961, 2020.
- The pitfalls of simplicity bias in neural networks. arXiv preprint arXiv:2006.07710, 2020.
- Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Springer, 2012.
- The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775, 2020.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1828–1838, 2020a.
- What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705, 2020b.
- Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning, pp. 24043–24055. PMLR, 2022.
- Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
- A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.
- Chenzhuang Du (10 papers)
- Jiaye Teng (13 papers)
- Tingle Li (14 papers)
- Yichen Liu (54 papers)
- Tianyuan Yuan (7 papers)
- Yue Wang (675 papers)
- Yang Yuan (52 papers)
- Hang Zhao (156 papers)