Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning (2405.17613v2)

Published 27 May 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  3. Vqa: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  4. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  5. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 2016.
  6. On the benefits of early fusion in multimodal representation learning. arXiv preprint arXiv:2011.07191, 2020.
  7. Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.
  8. Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  9. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  10. Sample selection bias correction theory. In Algorithmic Learning Theory: 19th International Conference, 2008.
  11. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  12. A transfer-learning approach for accelerated mri using deep neural networks. Magnetic resonance in medicine, 2020.
  13. Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  14. Palm-e: An embodied multimodal language model. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  15. On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023a.
  16. On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023b.
  17. A multilevel mixture-of-experts framework for pedestrian classification. IEEE Transactions on Image Processing, 2011.
  18. Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd international conference on information fusion (FUSION), 2020.
  19. Index of balanced accuracy: A performance measure for skewed class distributions. In IbPRIA, 2009.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  21. The rician distribution of noisy mri data. Magnetic resonance in medicine, 1995.
  22. On integrating a language model into neural machine translation. Computer Speech and Language, 2017.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
  24. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016b.
  25. A structural approach to selection bias. Epidemiology, 2004.
  26. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  27. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. Nature digital medicine, 2020.
  28. Jakobovski/free-spoken-digit-dataset: v1.0.8, 2018.
  29. Joint training of deep ensembles fails due to learner collusion. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  30. Mimic-iii, a freely accessible critical care database. Scientific data, 2016.
  31. Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  32. Multi- and cross-modal semantics beyond vision: Grounding in auditory perception. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
  33. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  34. Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  35. Multimodal machine learning in precision health: A scoping review. Nature Digital Medicine, 2022.
  36. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  37. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
  38. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  39. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  40. Multibench: Multiscale benchmarks for multimodal representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  41. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
  42. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research (TMLR), 2023.
  43. Factorized contrastive learning: Going beyond multi-view redundancy. Advances in Neural Information Processing Systems (NeurIPS), 2024.
  44. Polyvit: Co-training vision transformers on images, videos and audio. Transactions on Machine Learning Research (TMLR), 2023.
  45. Cascaded feature network for semantic segmentation of rgb-d images. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
  46. Modeling intra- and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics, 2022.
  47. Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  48. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  49. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2018.
  50. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  51. On sensitivity and robustness of normalization schemes to input distribution shifts in automatic MR image diagnosis. In Medical Imaging with Deep Learning (MIDL), 2023.
  52. Detecting incidental correlation in multimodal learning via latent variable modeling. Transactions on Machine Learning Research (TMLR), 2023.
  53. Majority vote of diverse classifiers for late fusion. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop. Springer, 2014.
  54. Multimodal integration learning of robot behavior using deep neural networks. Robotics and Autonomous Systems, 2014.
  55. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  56. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  57. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  58. Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015.
  59. Benchmarking deep learning models on large healthcare datasets. Journal of biomedical informatics, 2018.
  60. Rice, S. O. Mathematical analysis of random noise. The Bell System Technical Journal, 1944.
  61. The nmr phased array. Magnetic resonance in medicine, 1990.
  62. Accelerated magnetic resonance imaging by adversarial neural network. In DLMIA/ML-CDS@MICCAI, 2017.
  63. Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016.
  64. Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  65. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  66. Integrated multimodal artificial intelligence framework for healthcare applications. Nature Digital Medicine, 2022.
  67. Nlvr2 visual bias analysis. arXiv preprint arXiv:1909.10411, 2019.
  68. A corpus of natural language for visual reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017.
  69. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  70. Self-supervised learning from a multi-view perspective. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  71. Simulating single-coil mri from the responses of multiple coils. Communications in Applied Mathematics and Computational Science, 2020.
  72. Centralnet: a multilayer approach for multimodal fusion, 2018.
  73. What makes training multi-modal classification networks hard? In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020a.
  74. Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.
  75. Robot grasp detection using multimodal deep convolutional neural networks. Advances in Mechanical Engineering, 2016.
  76. To ensemble or not ensemble: When does end-to-end training fail? In Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD), 2021.
  77. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
  78. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2020.
  79. Admm-net: A deep learning approach for compressive sensing mri. arXiv preprint arXiv:1705.06869, 2017.
  80. fastmri: An open dataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.
  81. fastmri+: Clinical pathology annotations for knee and brain fully sampled multi-coil mri data. arXiv preprint arXiv:2109.03812, 2021.
  82. Intra-and inter-modal curriculum for multimodal learning. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
Citations (1)

Summary

  • The paper introduces the I2M2 framework using a probabilistic generative model to concurrently capture both inter- and intra-modality dependencies for multi-modal learning.
  • Empirical evaluation across healthcare, vision, and language tasks shows I2M2 consistently outperforms traditional methods by jointly modeling these dependencies, achieving notable accuracy gains.
  • The I2M2 framework offers a flexible, data-agnostic approach that enhances robustness and generalization in multi-modal tasks without requiring prior knowledge of dependency strengths.

Overview of "A Framework for Multi-modal Learning: Jointly Modeling Inter- & Intra-Modality Dependencies"

This paper presents a novel framework, termed the inter- & intra-modality modeling (I2M2) framework, for addressing the complexities of multi-modal learning. Traditional approaches in this domain have primarily focused on either inter-modality dependencies, which consider relationships between different modalities and their combined influence on the target label, or intra-modality dependencies, which focus on relationships within a single modality. This work challenges the adequacy of such isolated approaches and posits that a comprehensive framework that models both types of dependencies concurrently is essential for improving predictive performance across diverse applications.

Key Contributions

  1. Unified Modeling Framework: The authors introduce a probabilistic generative model to frame multi-modal learning comprehensively. In this model, both intra-modality and inter-modality dependencies are modeled using a selection variable. This approach acknowledges that the influence of individual modalities and their interactions can vary significantly across different datasets and tasks.
  2. Novel Methodology - I2M2: The paper advances the I2M2 framework which concurrently models inter- and intra-modality dependencies by leveraging a classifier for each modality and an additional classifier dedicated to capturing interactions between multiple modalities. This ensemble approach facilitates flexible and effective learning irrespective of the relative strength of inter- and intra-modality dependencies in a given dataset.
  3. Categorization of Existing Approaches: The framework provides a principled basis for categorizing existing multi-modal learning methodologies. Methods focusing primarily on inter-modality interactions are often less effective when faced with sparse cross-modal information. Conversely, those emphasizing intra-modality dependencies may miss critical cross-modal interactions.

Empirical Evaluation and Strong Numerical Results

The experimental assessment involves applying I2M2 across multiple domains, including healthcare (knee MRI exams, MIMIC-III for ICD-9 code prediction), and vision-and-language tasks (VQA and NLVR2). Here, the framework consistently demonstrates superior performance over traditional methods that focus on either inter- or intra-modality dependencies:

  • AV-MNIST: I2M2 improved classification accuracy by 1-2% compared to state-of-the-art multimodal fusion techniques.
  • FastMRI: Remarkably, I2M2 surpassed even established methods like the root-sum-of-squares, emphasizing its potential for tasks involving low SNR conditions.
  • MIMIC-III: Improved prediction accuracy in mortality and ICD-9 code prediction tasks indicates that capturing both modes of dependencies enhances robustness in clinical prediction scenarios.
  • Vision-and-Language Tasks: The approach maintained or improved performance on datasets like NLVR2 and achieved notable gains on VQA-VS across in-distribution and out-of-distribution settings.

Implications and Future Directions

The research underscores the need for a balanced approach to modeling modality dependencies in multi-modal tasks. By not requiring prior knowledge about the relative strengths of dependencies in datasets, I2M2 presents a flexible learning paradigm that can be adapted to various application contexts. Practically, this leverages redundancies, enhancing robustness, particularly in scenarios with distribution shifts.

Future developments may focus on refining the approach to scale with increasing modality input sizes efficiently. Addressing the computational complexity and optimizing end-to-end training mechanisms without compromising the integrative benefits I2M2 provides remains a key area of interest. Moreover, exploring its application in real-time systems and deploying it in environments with resource constraints could offer invaluable insights into the operational scalability and practical utility of the framework.

Overall, this paper solidifies the necessity of combining various dependency modeling methods to optimize and generalize learned models, paving the way for advancements in the multi-modal learning sphere across increasingly complex tasks and datasets.