Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities? (2310.06383v1)

Published 10 Oct 2023 in cs.AI

Abstract: With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Learning from multiple partially observed views - an application to multilingual text categorization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper/2009/file/f79921bbae40a577928b76d2fc3edc2a-Paper.pdf.
  2. VQA: visual question answering. CoRR, abs/1505.00468, 2015. URL http://arxiv.org/abs/1505.00468.
  3. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pp.  609–617, 2017.
  4. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017.
  5. Multimodal machine learning: A survey and taxonomy. CoRR, abs/1705.09406, 2017. URL http://arxiv.org/abs/1705.09406.
  6. On robustness to missing video for audiovisual speech recognition. 2022.
  7. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  8. Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Cooperative learning for multi-view analysis, 2021. URL https://arxiv.org/abs/2112.12337.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. On uni-modal feature learning in supervised multi-modal learning. arXiv preprint arXiv:2305.01233, 2023.
  13. Multimodal deep learning for robust RGB-D object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015, pp.  681–687. IEEE, 2015. doi: 10.1109/IROS.2015.7353446. URL https://doi.org/10.1109/IROS.2015.7353446.
  14. M. Feder and N. Merhav. Relations between entropy and error probability. IEEE Transactions on Information Theory, 40(1):259–266, 1994. doi: 10.1109/18.272494.
  15. Learning robust representations via multi-view information bottleneck. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=B1xwcyHFDr.
  16. Convolutional two-stream network fusion for video action recognition. CoRR, abs/1604.06573, 2016. URL http://arxiv.org/abs/1604.06573.
  17. Bayes error estimation using parzen and k-nn procedures. IEEE Transactions on Pattern Analysis and Machine Intelligence, (5):634–643, 1987.
  18. Listen to look: Action recognition by previewing audio. CoRR, abs/1912.04487, 2019. URL http://arxiv.org/abs/1912.04487.
  19. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  20. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15180–15190, 2023.
  21. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  22. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. CoRR, abs/1711.06420, 2017. URL http://arxiv.org/abs/1711.06420.
  23. Trusted multi-view classification. CoRR, abs/2102.02051, 2021. URL https://arxiv.org/abs/2102.02051.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  25. What makes multimodal learning better than single (provably). CoRR, abs/2106.04538, 2021. URL https://arxiv.org/abs/2106.04538.
  26. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). arXiv preprint arXiv:2203.12221, 2022.
  27. Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pp.  378–386, 2020.
  28. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR, abs/1612.06890, 2016. URL http://arxiv.org/abs/1612.06890.
  29. Multi-view regression via canonical correlation analysis. In International Conference on Computational Learning Theory, pp.  82–96. Springer, 2007.
  30. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.
  31. On single source robustness in deep fusion models. CoRR, abs/1906.04691, 2019. URL http://arxiv.org/abs/1906.04691.
  32. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp.  5583–5594. PMLR, 2021.
  33. Gueorgi Kossinets. Effects of missing data in social networks. Social networks, 28(3):247–268, 2006.
  34. A closer look at the robustness of vision-and-language pre-trained models. CoRR, abs/2012.08673, 2020. URL https://arxiv.org/abs/2012.08673.
  35. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  36. COMPLETER: incomplete multi-view clustering via contrastive prediction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp.  11174–11183. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01102. URL https://openaccess.thecvf.com/content/CVPR2021/html/Lin_COMPLETER_Incomplete_Multi-View_Clustering_via_Contrastive_Prediction_CVPR_2021_paper.html.
  37. Dual contrastive prediction for incomplete multi-view representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–14, 2022. doi: 10.1109/TPAMI.2022.3197238.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  2302–2310, 2021.
  40. Are multimodal transformers robust to missing modality?, 2022.
  41. Balanced multimodal learning via on-the-fly gradient modulation. arXiv preprint arXiv:2203.15332, 2022.
  42. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021a. URL https://arxiv.org/abs/2103.00020.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021b.
  44. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021a. URL https://arxiv.org/abs/2102.12092.
  45. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021b.
  46. Are VQA systems rad? measuring robustness to augmented data with focused interventions. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021, pp.  61–70. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-short.10. URL https://doi.org/10.18653/v1/2021.acl-short.10.
  47. Vigan: Missing view imputation with generative adversarial networks. In 2017 IEEE International Conference on Big Data (Big Data), pp.  766–775. IEEE, 2017.
  48. Multi-domain image completion for random missing input data. CoRR, abs/2007.05534, 2020. URL https://arxiv.org/abs/2007.05534.
  49. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML workshop on learning with multiple views, volume 2005, pp.  74–79. Citeseer, 2005.
  50. An information theoretic framework for multi-view learning. In Rocco A. Servedio and Tong Zhang (eds.), 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pp.  403–414. Omnipress, 2008a. URL http://colt2008.cs.helsinki.fi/papers/94-Sridharan.pdf.
  51. An information theoretic framework for multi-view learning. 2008b.
  52. James V Stone. Information theory: a tutorial introduction. 2015.
  53. Tcgm: An information-theoretic framework for semi-supervised multi-modality learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp.  171–188. Springer, 2020.
  54. Metric learning on healthcare data with incomplete modalities. In IJCAI, volume 3534, pp.  3540, 2019.
  55. Can audio-visual integration strengthen robustness under multimodal attacks? CoRR, abs/2104.02000, 2021. URL https://arxiv.org/abs/2104.02000.
  56. Audio-visual event localization in unconstrained videos. In Proceedings of the European conference on computer vision (ECCV), pp.  247–263, 2018.
  57. Contrastive multiview coding. CoRR, abs/1906.05849, 2019. URL http://arxiv.org/abs/1906.05849.
  58. Contrastive learning, multi-view redundancy, and linear models. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato (eds.), Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pp.  1179–1206. PMLR, 2021. URL http://proceedings.mlr.press/v132/tosh21a.html.
  59. Missing modalities imputation via cascaded residual autoencoder. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4971–4980, 2017. doi: 10.1109/CVPR.2017.528.
  60. Learning factorized multimodal representations. CoRR, abs/1806.06176, 2018. URL http://arxiv.org/abs/1806.06176.
  61. Demystifying self-supervised learning: An information-theoretical framework. CoRR, abs/2006.05576, 2020. URL https://arxiv.org/abs/2006.05576.
  62. Out-of-distribution detection in classifiers via generation. ArXiv, abs/1910.04241, 2019.
  63. Centralnet: a multilayer approach for multimodal fusion. CoRR, abs/1808.07275, 2018. URL http://arxiv.org/abs/1808.07275.
  64. Multi-modal learning with missing modality via shared-specific feature modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15878–15887, 2023.
  65. Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In European Conference on Computer Vision, pp.  664–679. Springer, 2016.
  66. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12695–12705, 2020.
  67. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp.  1–6. IEEE, 2015.
  68. Functional properties of minimum mean-square error and mutual information. IEEE Transactions on Information Theory, 58(3):1289–1301, 2012. doi: 10.1109/TIT.2011.2174959.
  69. A survey on multi-view learning. CoRR, abs/1304.5634, 2013. URL http://arxiv.org/abs/1304.5634.
  70. CLEVRER: collision events for video representation and reasoning. CoRR, abs/1910.01442, 2019. URL http://arxiv.org/abs/1910.01442.
  71. Investigating vulnerability to adversarial examples on multimodal data fusion in deep learning. CoRR, abs/2005.10987, 2020. URL https://arxiv.org/abs/2005.10987.
  72. Multi-source learning for joint analysis of incomplete multi-modality neuroimaging data. In Qiang Yang, Deepak Agarwal, and Jian Pei (eds.), The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012, pp.  1149–1157. ACM, 2012. doi: 10.1145/2339530.2339710. URL https://doi.org/10.1145/2339530.2339710.
  73. Cpm-nets: Cross partial multi-view networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/11b9842e0a271ff252c1903e7132cd68-Paper.pdf.
  74. Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023.
  75. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pp.  570–586, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Siting Li (7 papers)
  2. Chenzhuang Du (10 papers)
  3. Yue Zhao (394 papers)
  4. Yu Huang (176 papers)
  5. Hang Zhao (156 papers)
Citations (3)