Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning (2405.17613v2)
Abstract: Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Vqa: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
- Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 2016.
- On the benefits of early fusion in multimodal representation learning. arXiv preprint arXiv:2011.07191, 2020.
- Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.
- Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Sample selection bias correction theory. In Algorithmic Learning Theory: 19th International Conference, 2008.
- Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- A transfer-learning approach for accelerated mri using deep neural networks. Magnetic resonance in medicine, 2020.
- Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Palm-e: An embodied multimodal language model. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
- On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023a.
- On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023b.
- A multilevel mixture-of-experts framework for pedestrian classification. IEEE Transactions on Image Processing, 2011.
- Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd international conference on information fusion (FUSION), 2020.
- Index of balanced accuracy: A performance measure for skewed class distributions. In IbPRIA, 2009.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- The rician distribution of noisy mri data. Magnetic resonance in medicine, 1995.
- On integrating a language model into neural machine translation. Computer Speech and Language, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
- Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016b.
- A structural approach to selection bias. Epidemiology, 2004.
- Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. Nature digital medicine, 2020.
- Jakobovski/free-spoken-digit-dataset: v1.0.8, 2018.
- Joint training of deep ensembles fails due to learner collusion. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Mimic-iii, a freely accessible critical care database. Scientific data, 2016.
- Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Multi- and cross-modal semantics beyond vision: Grounding in auditory perception. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multimodal machine learning in precision health: A scoping review. Nature Digital Medicine, 2022.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
- Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multibench: Multiscale benchmarks for multimodal representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Foundations and trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research (TMLR), 2023.
- Factorized contrastive learning: Going beyond multi-view redundancy. Advances in Neural Information Processing Systems (NeurIPS), 2024.
- Polyvit: Co-training vision transformers on images, videos and audio. Transactions on Machine Learning Research (TMLR), 2023.
- Cascaded feature network for semantic segmentation of rgb-d images. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
- Modeling intra- and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics, 2022.
- Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2018.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- On sensitivity and robustness of normalization schemes to input distribution shifts in automatic MR image diagnosis. In Medical Imaging with Deep Learning (MIDL), 2023.
- Detecting incidental correlation in multimodal learning via latent variable modeling. Transactions on Machine Learning Research (TMLR), 2023.
- Majority vote of diverse classifiers for late fusion. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop. Springer, 2014.
- Multimodal integration learning of robot behavior using deep neural networks. Robotics and Autonomous Systems, 2014.
- Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015.
- Benchmarking deep learning models on large healthcare datasets. Journal of biomedical informatics, 2018.
- Rice, S. O. Mathematical analysis of random noise. The Bell System Technical Journal, 1944.
- The nmr phased array. Magnetic resonance in medicine, 1990.
- Accelerated magnetic resonance imaging by adversarial neural network. In DLMIA/ML-CDS@MICCAI, 2017.
- Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016.
- Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Integrated multimodal artificial intelligence framework for healthcare applications. Nature Digital Medicine, 2022.
- Nlvr2 visual bias analysis. arXiv preprint arXiv:1909.10411, 2019.
- A corpus of natural language for visual reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Self-supervised learning from a multi-view perspective. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Simulating single-coil mri from the responses of multiple coils. Communications in Applied Mathematics and Computational Science, 2020.
- Centralnet: a multilayer approach for multimodal fusion, 2018.
- What makes training multi-modal classification networks hard? In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020a.
- Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.
- Robot grasp detection using multimodal deep convolutional neural networks. Advances in Mechanical Engineering, 2016.
- To ensemble or not ensemble: When does end-to-end training fail? In Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD), 2021.
- Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
- Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2020.
- Admm-net: A deep learning approach for compressive sensing mri. arXiv preprint arXiv:1705.06869, 2017.
- fastmri: An open dataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.
- fastmri+: Clinical pathology annotations for knee and brain fully sampled multi-coil mri data. arXiv preprint arXiv:2109.03812, 2021.
- Intra-and inter-modal curriculum for multimodal learning. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.