Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation (2310.03986v6)

Published 6 Oct 2023 in cs.CV and cs.LG

Abstract: Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Layer normalization. Advances in Neural Information Processing Systems: Deep Learning Symposium. arXiv preprint arXiv:1607.06450, 2016.
  2. MultiMAE: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, 2022.
  3. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018.
  4. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2018.
  5. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022.
  6. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In European Conference on Computer Vision (ECCV), 2020.
  7. EmbraceNet: A robust deep learning architecture for multimodal classification. Information Fusion, 51:259–270, 2019.
  8. FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4467–4473. IEEE, 2021.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  10. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  11. Hetero-modal variational encoder-decoder for joint modality completion and segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22, pages 74–82. Springer, 2019.
  12. Krona: Parameter efficient tuning with kronecker adapter, 2022.
  13. SpiderMesh: Spatial-aware demand-guided recursive meshing for RGB-T semantic segmentation. arXiv:2303.08692, 2023.
  14. Supervised contrastive learning for robust and efficient multi-modal emotion and sentiment analysis. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2423–2429. IEEE, 2022.
  15. Multimodal semi-supervised learning for image classification. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 902–909. IEEE, 2010.
  16. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360, 2014.
  17. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5108–5115, 2017.
  18. Analyzing modality robustness in multimodal sentiment analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 685–696, Seattle, United States, July 2022. Association for Computational Linguistics.
  19. FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In Asian Conference on Computer Vision, 2016.
  20. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  22. Parameter-efficient model adaptation for vision transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):817–825, Jun. 2023.
  23. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR, 09–15 Jun 2019.
  24. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  25. Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
  26. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  27. Efficient yet deep convolutional neural networks for semantic segmentation. In 2018 International Symposium on Advanced Intelligent Informatics (SAIN), pages 123–130, 2018.
  28. Multimodal sentiment analysis: A survey and comparison. Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines, pages 1846–1870, 2022.
  29. A unified representation network for segmentation with missing modalities. arXiv preprint arXiv:1908.06683, 2019.
  30. What makes for robust multi-modal models in the face of missing modalities?, 2023.
  31. Scaling & shifting your features: A new baseline for efficient model tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  32. Multimodal material segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19800–19808, June 2022.
  33. Variational probabilistic fusion network for RGB-T semantic segmentation. arXiv preprint arXiv:2307.08536, 2023.
  34. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 1950–1965. Curran Associates, Inc., 2022.
  35. Efficient low-rank multimodal fusion with modality-specific factors. In Annual Meeting of the Association for Computational Linguistics, 2018.
  36. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  37. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  38. Are multimodal transformers robust to missing modality? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18177–18186, 2022.
  39. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021.
  40. Missing modality robustness in semi-supervised multi-modal semantic segmentation. arXiv preprint arXiv:2304.10756, 2023.
  41. On robustness in multimodal learning. arXiv preprint arXiv:2304.04385, 2023.
  42. ModDrop: Adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2015.
  43. Integrating multimodal information in large pretrained transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2359–2369, Online, July 2020. Association for Computational Linguistics.
  44. Exploring missing modality in multimodal egocentric datasets, 2024.
  45. Multimodal semantic segmentation in autonomous driving: A review of current approaches and future perspectives. Technologies, 10(4):90, 2022.
  46. Multimodal fusion transformer for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  47. Missing MRI pulse sequence synthesis using multi-modal generative adversarial network. IEEE Transactions on Medical Imaging, 39(4):1170–1183, 2019.
  48. Complementary random masking for RGB-Thermal semantic segmentation. arXiv preprint arXiv:2303.17386, 2023.
  49. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision, 2012.
  50. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3–14, 2017.
  51. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robotics and Automation Letters, 4(3):2576–2583, July 2019.
  52. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems, 30, 2017.
  53. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy, July 2019. Association for Computational Linguistics.
  54. Multi-modal learning with missing modality via shared-specific feature modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023.
  55. Multimodal token fusion for vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  56. Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  57. Learning deep multimodal feature representation with asymmetric multi-layer fusion. In ACM International Conference on Multimedia (ACM MM), 2020.
  58. Towards good practices for missing modality robust action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023.
  59. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  60. Group normalization, 2018.
  61. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 23(1):537–547, 2020.
  62. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Neural Information Processing Systems (NeurIPS), 2021.
  63. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  64. 3D cGAN based cross-modality MR image synthesis for brain tumor segmentation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 626–630. IEEE, 2018.
  65. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 30(12):4467–4480, 2019.
  66. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10790–10797, 2021.
  67. Tensor fusion network for multimodal sentiment analysis. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
  68. Memory fusion network for multi-view sequential learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.
  69. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 31(6):82–88, 2016.
  70. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1136–1147, 2023.
  71. A multimodal fusion approach for image captioning. Neurocomputing, 329:476–485, 2019.
  72. Mitigating modality discrepancies for RGB-T semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Md Kaykobad Reza (5 papers)
  2. Ashley Prater-Bennette (14 papers)
  3. M. Salman Asif (54 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com