Learning to Rebalance Multi-Modal Optimization by Adaptively Masking Subnetworks (2404.08347v1)
Abstract: Multi-modal learning aims to enhance performance by unifying models from various modalities but often faces the "modality imbalance" problem in real data, leading to a bias towards dominant modalities and neglecting others, thereby limiting its overall effectiveness. To address this challenge, the core idea is to balance the optimization of each modality to achieve a joint optimum. Existing approaches often employ a modal-level control mechanism for adjusting the update of each modal parameter. However, such a global-wise updating mechanism ignores the different importance of each parameter. Inspired by subnetwork optimization, we explore a uniform sampling-based optimization strategy and find it more effective than global-wise updating. According to the findings, we further propose a novel importance sampling-based, element-wise joint optimization method, called Adaptively Mask Subnetworks Considering Modal Significance(AMSS). Specifically, we incorporate mutual information rates to determine the modal significance and employ non-uniform adaptive sampling to select foreground subnetworks from each modality for parameter updates, thereby rebalancing multi-modal learning. Additionally, we demonstrate the reliability of the AMSS strategy through convergence analysis. Building upon theoretical insights, we further enhance the multi-modal mask subnetwork strategy using unbiased estimation, referred to as AMSS+. Extensive experiments reveal the superiority of our approach over comparison methods.
- Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” TPAMI, no. 01, pp. 1–20, 2022.
- J. C. Pereira and N. Vasconcelos, “Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems,” CVIU, vol. 124, pp. 123–135, 2014.
- Z. Khan and Y. Fu, “Exploiting BERT for multimodal target sentiment classification through input space translation,” in ACM MM, 2021, pp. 3034–3042.
- Q. Lu, Y. Long, X. Sun, J. Feng, and H. Zhang, “Fact-sentiment incongruity combination network for multimodal sarcasm detection,” INFORM FUSION, vol. 104, p. 102203, 2024.
- R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, “Progressive instance-aware feature learning for compositional action recognition,” TPAMI, vol. 45, no. 8, pp. 10 317–10 330, 2023.
- Y. Yang, D. Zhou, D. Zhan, H. Xiong, and Y. Jiang, “Adaptive deep models for incremental learning: Considering capacity scalability and sustainability,” in KDD, 2019, pp. 74–82.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
- Y. Yang, Z. Sun, H. Zhu, Y. Fu, Y. Zhou, H. Xiong, and J. Yang, “Learning adaptive embedding considering incremental class,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 2736–2749, 2023.
- Y. Yang, Y. Wu, D. Zhan, Z. Liu, and Y. Jiang, “Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport,” in KDD, 2018, pp. 2594–2603.
- Y. Yang, Z.-Y. Fu, D.-C. Zhan, Z.-B. Liu, and Y. Jiang, “Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport,” TKDE, vol. 33, no. 2, pp. 696–709, 2019.
- Y. Yang, K.-T. Wang, D.-C. Zhan, H. Xiong, and Y. Jiang, “Comprehensive semi-supervised multi-modal learning.” in IJCAI, 2019, pp. 4092–4098.
- Y. Yang, J. Yang, R. Bao, D. Zhan, H. Zhu, X. Gao, H. Xiong, and J. Yang, “Corporate relative valuation using heterogeneous multi-modal graph neural network,” TKDE, vol. 35, no. 1, pp. 211–224, 2023.
- T. Zhu, L. Li, J. Yang, S. Zhao, H. Liu, and J. Qian, “Multimodal sentiment analysis with image-text interaction network,” TMM, vol. 25, pp. 3375–3385, 2023.
- H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “MMTM: multimodal transfer module for CNN fusion,” in CVPR, 2020, pp. 13 289–13 299.
- X. Liang, Y. Qian, Q. Guo, H. Cheng, and J. Liang, “AF: an association-based fusion method for multi-modal classification,” TPAMI, vol. 44, no. 12, pp. 9236–9254, 2022.
- H. Ma, Z. Han, C. Zhang, H. Fu, J. T. Zhou, and Q. Hu, “Trustworthy multimodal regression with mixture of normal-inverse gamma distributions,” in NIPS, 2021, pp. 6881–6893.
- A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” in NIPS, 2021, pp. 14 200–14 213.
- X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in CVPR, 2022, pp. 8238–8247.
- Y. Fan, W. Xu, H. Wang, J. Wang, and S. Guo, “Pmr: Prototypical modal rebalance for multimodal learning,” in CVPR, 2023, pp. 20 029–20 038.
- W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in CVPR, 2020, pp. 12 695–12 705.
- Y. Huang, J. Lin, C. Zhou, H. Yang, and L. Huang, “Modality competition: What makes joint training of multi-modal network fail in deep learning? (provably),” in ICML, 2022, pp. 9226–9259.
- N. Wu, S. Jastrzebski, K. Cho, and K. J. Geras, “Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks,” in ICML, 2022, pp. 24 043–24 055.
- V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie, “Centralnet: A multilayer approach for multimodal fusion,” in ECCV, vol. 11134, 2018, pp. 575–589.
- J. Delbrouck, N. Tits, M. Brousmiche, and S. Dupont, “A transformer-based joint-encoding for emotion recognition and sentiment analysis,” CoRR, vol. abs/2006.15955, 2020.
- Y. Yao and R. Mihalcea, “Modality-specific learning rates for effective multimodal additive late-fusion,” in ACL, 2022, pp. 1824–1834.
- H. Li, X. Li, P. Hu, Y. Lei, C. Li, and Y. Zhou, “Boosting multi-modal model performance with adaptive gradient modulation,” in ICCV, 2023, pp. 22 214–22 224.
- L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in ICML, 2013, pp. 1058–1066.
- C. Lee, K. Cho, and W. Kang, “Mixout: Effective regularization to finetune large-scale pretrained language models,” in ICLR, 2020.
- R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in ICCV, 2017, pp. 609–617.
- P. Zhao and T. Zhang, “Stochastic optimization with importance sampling for regularized loss minimization,” in international conference on machine learning. PMLR, 2015, pp. 1–9.
- T. Baltrusaitis, C. Ahuja, and L. Morency, “Multimodal machine learning: A survey and taxonomy,” TPAMI, vol. 41, no. 2, pp. 423–443, 2019.
- Y. Yang, Y. Wu, D. Zhan, Z. Liu, and Y. Jiang, “Deep robust unsupervised multi-modal network,” in AAAI. AAAI Press, 2019, pp. 5652–5659.
- Y. Yang, D. Zhan, Y. Wu, Z. Liu, H. Xiong, and Y. Jiang, “Semi-supervised multi-modal clustering and classification with incomplete modalities,” TKDE, vol. 33, no. 2, pp. 682–695, 2021.
- P. K. Atrey, M. A. Hossain, A. El-Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” MULTIMEDIA SYST, vol. 16, no. 6, pp. 345–379, 2010.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in AAAI, 2018, pp. 3942–3951.
- W. Nie, Y. Yan, D. Song, and K. Wang, “Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition,” MULTIMED TOOLS APPL, vol. 80, no. 11, pp. 16 205–16 214, 2021.
- Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation network for multimodal sentiment analysis,” INFORM FUSION, vol. 102, p. 102031, 2024.
- X. Liu, X. Zhu, M. Li, L. Wang, C. Tang, J. Yin, D. Shen, H. Wang, and W. Gao, “Late fusion incomplete multi-view clustering,” TPAMI, vol. 41, no. 10, pp. 2410–2423, 2018.
- Y. Du, Y. Wang, J. Hu, X. Li, and X. Chen, “An emotion role mining approach based on multiview ensemble learning in social networks,” INFORM FUSION, vol. 88, pp. 100–114, 2022.
- M. Alfaro-Contreras, J. J. Valero-Mas, J. M. Iñesta, and J. Calvo-Zaragoza, “Late multimodal fusion for image and audio music transcription,” EXPERT SYST APPL, vol. 216, p. 119491, 2023.
- B. Liu, L. He, Y. Xie, Y. Xiang, L. Zhu, and W. Ding, “Minjot: Multimodal infusion joint training for noise learning in text and multimodal classification problems,” INFORM FUSION, vol. 102, p. 102071, 2024.
- X. Zheng, C. Tang, Z. Wan, C. Hu, and W. Zhang, “Multi-level confidence learning for trustworthy multimodal classification,” in AAAI, 2023, pp. 11 381–11 389.
- C. Du, T. Li, Y. Liu, Z. Wen, T. Hua, Y. Wang, and H. Zhao, “Improving multi-modal learning with uni-modal teachers,” CoRR, vol. abs/2106.11059, 2021.
- Y. Sun, S. Mai, and H. Hu, “Learning to balance the learning rates between various modalities via adaptive tracking factor,” SPL, vol. 28, pp. 1650–1654, 2021.
- D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in NIPS, 2015, pp. 2575–2583.
- J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks,” in CVPR, 2015, pp. 648–656.
- R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang, “Raise a child in large language model: Towards effective and generalizable fine-tuning,” in EMNLP, 2021, pp. 9514–9528.
- H. Zhang, G. Li, J. Li, Z. Zhang, Y. Zhu, and Z. Jin, “Fine-tuning pre-trained language models effectively by optimizing subnetworks adaptively,” in NIPS, 2022, pp. 21 442–21 454.
- Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view classification with dynamic evidential fusion,” TPAMI, vol. 45, no. 2, pp. 2551–2566, 2023.
- K. Sridharan and S. M. Kakade, “An information theoretic framework for multi-view learning,” in COLT, 2008, pp. 403–414.
- M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, R. D. Hjelm, and A. C. Courville, “Mutual information neural estimation,” in ICML, 2018, pp. 530–539.
- P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “CLUB: A contrastive log-ratio upper bound of mutual information,” in ICML, 2020, pp. 1779–1788.
- J. Sourati, M. Akçakaya, D. Erdogmus, T. K. Leen, and J. G. Dy, “A probabilistic active learning algorithm based on fisher information ratio,” TPAMI, vol. 40, no. 8, pp. 2023–2029, 2018.
- S. P. Singh and D. Alistarh, “Woodfisher: Efficient second-order approximation for neural network compression,” in NIPS, 2020.
- R. A. Fisher, “On the mathematical foundations of theoretical statistics,” Phil. Trans., vol. 222, no. 594-604, pp. 309–368, 1922.
- M. Tu, V. Berisha, Y. Cao, and J. Seo, “Reducing the model order of deep neural networks using information theory,” in ISVLSI, 2016, pp. 93–98.
- J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” PNAS, vol. 114, no. 13, pp. 3521–3526, 2017.
- S. Gopal, “Adaptive sampling for sgd by exploiting side information,” in ICML, 2016, pp. 364–372.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: crowd-sourced emotional multimodal actors dataset,” IEEE T AFFECT COMPUT, vol. 5, no. 4, pp. 377–390, 2014.
- Y. Cai, H. Cai, and X. Wan, “Multi-modal sarcasm detection in twitter with hierarchical fusion model,” in ACL, 2019, pp. 2506–2515.
- J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal sentiment classification,” in IJCAI, 2019, pp. 5408–5414.
- P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks,” in CVPR, 2016, pp. 4207–4215.
- Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network for named entity recognition in tweets,” in AAAI. AAAI Press, 2018, pp. 5674–5681.
- N. Fujimori, R. Endo, Y. Kawai, and T. Mochizuki, “Modality-specific learning rate control for multimodal classification,” in ACPR, 2020, pp. 412–422.
- Y. Yang, J. Zhang, F. Gao, X. Gao, and H. Zhu, “DOMFN: A divergence-orientated multi-modal fusion network for resume assessment,” in ACM MM, 2022, pp. 1612–1620.
- H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in ICASSP, 2020, pp. 721–725.
- B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in SciPy, 2015, pp. 18–24.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in CVPR, 2017, pp. 4724–4733.
- J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J MACH LEARN RES, vol. 12, no. 7, pp. 2121–2159, 2011.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.