Sparsely Multimodal Data Fusion (2403.20280v2)
Abstract: Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.
- Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- A survey of transformer-based multimodal pre-trained modals. Neurocomputing, 515:89–106, 2023.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- Mira: joint regulatory modeling of multimodal expression and chromatin accessibility in single cells. Nature Methods, 19(9):1097–1108, 2022.
- Clara: Multilingual contrastive learning for audio representation acquisition. arXiv preprint arXiv:2310.11830, 2023.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221, 2021.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, 33:25–37, 2020.
- Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. arXiv preprint arXiv:2103.08849, 2021.
- Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13(1):3094, 2022.
- Mavil: Masked audio-video learners. Advances in Neural Information Processing Systems, 36, 2024.
- Best of both worlds: Multimodal contrastive learning with tabular and imaging data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23924–23935, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, page 2359. NIH Public Access, 2020.
- 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36, 2024.
- Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 20020–20029, 2022.
- Zorro: the masked multimodal transformer. arXiv preprint arXiv:2301.09595, 2023.
- Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1236–1248, 2024.
- M2ftrans: Modality-masked fusion transformer for incomplete multi-modality brain tumor segmentation. IEEE Journal of Biomedical and Health Informatics, 2023.
- mmformer: Multimodal medical transformer for incomplete multimodal learning of brain tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 107–117. Springer, 2022.
- Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature communications, 12(1):31, 2021.
- Learning unseen modality interaction. arXiv preprint arXiv:2306.12795, 2023.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
- Training transitive and commutative multimodal transformers with loretta. arXiv preprint arXiv:2305.14243, 2023.
- One-stage modality distillation for incomplete multimodal learning. arXiv preprint arXiv:2309.08204, 2023.
- Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics, pages 4348–4380. PMLR, 2023.
- A multi-sensor dataset with annotated activities of daily living recorded in a residential setting. Scientific Data, 10(1):162, 2023.
- Deep multi-modal fusion of image and non-image data in disease diagnosis and prognosis: a review. Progress in Biomedical Engineering, 2023.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
- Multi-attention recurrent network for human communication comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.