Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Rebalance Multi-Modal Optimization by Adaptively Masking Subnetworks (2404.08347v1)

Published 12 Apr 2024 in cs.CV and cs.LG

Abstract: Multi-modal learning aims to enhance performance by unifying models from various modalities but often faces the "modality imbalance" problem in real data, leading to a bias towards dominant modalities and neglecting others, thereby limiting its overall effectiveness. To address this challenge, the core idea is to balance the optimization of each modality to achieve a joint optimum. Existing approaches often employ a modal-level control mechanism for adjusting the update of each modal parameter. However, such a global-wise updating mechanism ignores the different importance of each parameter. Inspired by subnetwork optimization, we explore a uniform sampling-based optimization strategy and find it more effective than global-wise updating. According to the findings, we further propose a novel importance sampling-based, element-wise joint optimization method, called Adaptively Mask Subnetworks Considering Modal Significance(AMSS). Specifically, we incorporate mutual information rates to determine the modal significance and employ non-uniform adaptive sampling to select foreground subnetworks from each modality for parameter updates, thereby rebalancing multi-modal learning. Additionally, we demonstrate the reliability of the AMSS strategy through convergence analysis. Building upon theoretical insights, we further enhance the multi-modal mask subnetwork strategy using unbiased estimation, referred to as AMSS+. Extensive experiments reveal the superiority of our approach over comparison methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” TPAMI, no. 01, pp. 1–20, 2022.
  2. J. C. Pereira and N. Vasconcelos, “Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems,” CVIU, vol. 124, pp. 123–135, 2014.
  3. Z. Khan and Y. Fu, “Exploiting BERT for multimodal target sentiment classification through input space translation,” in ACM MM, 2021, pp. 3034–3042.
  4. Q. Lu, Y. Long, X. Sun, J. Feng, and H. Zhang, “Fact-sentiment incongruity combination network for multimodal sarcasm detection,” INFORM FUSION, vol. 104, p. 102203, 2024.
  5. R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, “Progressive instance-aware feature learning for compositional action recognition,” TPAMI, vol. 45, no. 8, pp. 10 317–10 330, 2023.
  6. Y. Yang, D. Zhou, D. Zhan, H. Xiong, and Y. Jiang, “Adaptive deep models for incremental learning: Considering capacity scalability and sustainability,” in KDD, 2019, pp. 74–82.
  7. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
  8. Y. Yang, Z. Sun, H. Zhu, Y. Fu, Y. Zhou, H. Xiong, and J. Yang, “Learning adaptive embedding considering incremental class,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 2736–2749, 2023.
  9. Y. Yang, Y. Wu, D. Zhan, Z. Liu, and Y. Jiang, “Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport,” in KDD, 2018, pp. 2594–2603.
  10. Y. Yang, Z.-Y. Fu, D.-C. Zhan, Z.-B. Liu, and Y. Jiang, “Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport,” TKDE, vol. 33, no. 2, pp. 696–709, 2019.
  11. Y. Yang, K.-T. Wang, D.-C. Zhan, H. Xiong, and Y. Jiang, “Comprehensive semi-supervised multi-modal learning.” in IJCAI, 2019, pp. 4092–4098.
  12. Y. Yang, J. Yang, R. Bao, D. Zhan, H. Zhu, X. Gao, H. Xiong, and J. Yang, “Corporate relative valuation using heterogeneous multi-modal graph neural network,” TKDE, vol. 35, no. 1, pp. 211–224, 2023.
  13. T. Zhu, L. Li, J. Yang, S. Zhao, H. Liu, and J. Qian, “Multimodal sentiment analysis with image-text interaction network,” TMM, vol. 25, pp. 3375–3385, 2023.
  14. H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “MMTM: multimodal transfer module for CNN fusion,” in CVPR, 2020, pp. 13 289–13 299.
  15. X. Liang, Y. Qian, Q. Guo, H. Cheng, and J. Liang, “AF: an association-based fusion method for multi-modal classification,” TPAMI, vol. 44, no. 12, pp. 9236–9254, 2022.
  16. H. Ma, Z. Han, C. Zhang, H. Fu, J. T. Zhou, and Q. Hu, “Trustworthy multimodal regression with mixture of normal-inverse gamma distributions,” in NIPS, 2021, pp. 6881–6893.
  17. A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” in NIPS, 2021, pp. 14 200–14 213.
  18. X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in CVPR, 2022, pp. 8238–8247.
  19. Y. Fan, W. Xu, H. Wang, J. Wang, and S. Guo, “Pmr: Prototypical modal rebalance for multimodal learning,” in CVPR, 2023, pp. 20 029–20 038.
  20. W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in CVPR, 2020, pp. 12 695–12 705.
  21. Y. Huang, J. Lin, C. Zhou, H. Yang, and L. Huang, “Modality competition: What makes joint training of multi-modal network fail in deep learning? (provably),” in ICML, 2022, pp. 9226–9259.
  22. N. Wu, S. Jastrzebski, K. Cho, and K. J. Geras, “Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks,” in ICML, 2022, pp. 24 043–24 055.
  23. V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie, “Centralnet: A multilayer approach for multimodal fusion,” in ECCV, vol. 11134, 2018, pp. 575–589.
  24. J. Delbrouck, N. Tits, M. Brousmiche, and S. Dupont, “A transformer-based joint-encoding for emotion recognition and sentiment analysis,” CoRR, vol. abs/2006.15955, 2020.
  25. Y. Yao and R. Mihalcea, “Modality-specific learning rates for effective multimodal additive late-fusion,” in ACL, 2022, pp. 1824–1834.
  26. H. Li, X. Li, P. Hu, Y. Lei, C. Li, and Y. Zhou, “Boosting multi-modal model performance with adaptive gradient modulation,” in ICCV, 2023, pp. 22 214–22 224.
  27. L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in ICML, 2013, pp. 1058–1066.
  28. C. Lee, K. Cho, and W. Kang, “Mixout: Effective regularization to finetune large-scale pretrained language models,” in ICLR, 2020.
  29. R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in ICCV, 2017, pp. 609–617.
  30. P. Zhao and T. Zhang, “Stochastic optimization with importance sampling for regularized loss minimization,” in international conference on machine learning.   PMLR, 2015, pp. 1–9.
  31. T. Baltrusaitis, C. Ahuja, and L. Morency, “Multimodal machine learning: A survey and taxonomy,” TPAMI, vol. 41, no. 2, pp. 423–443, 2019.
  32. Y. Yang, Y. Wu, D. Zhan, Z. Liu, and Y. Jiang, “Deep robust unsupervised multi-modal network,” in AAAI.   AAAI Press, 2019, pp. 5652–5659.
  33. Y. Yang, D. Zhan, Y. Wu, Z. Liu, H. Xiong, and Y. Jiang, “Semi-supervised multi-modal clustering and classification with incomplete modalities,” TKDE, vol. 33, no. 2, pp. 682–695, 2021.
  34. P. K. Atrey, M. A. Hossain, A. El-Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” MULTIMEDIA SYST, vol. 16, no. 6, pp. 345–379, 2010.
  35. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in AAAI, 2018, pp. 3942–3951.
  36. W. Nie, Y. Yan, D. Song, and K. Wang, “Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition,” MULTIMED TOOLS APPL, vol. 80, no. 11, pp. 16 205–16 214, 2021.
  37. Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation network for multimodal sentiment analysis,” INFORM FUSION, vol. 102, p. 102031, 2024.
  38. X. Liu, X. Zhu, M. Li, L. Wang, C. Tang, J. Yin, D. Shen, H. Wang, and W. Gao, “Late fusion incomplete multi-view clustering,” TPAMI, vol. 41, no. 10, pp. 2410–2423, 2018.
  39. Y. Du, Y. Wang, J. Hu, X. Li, and X. Chen, “An emotion role mining approach based on multiview ensemble learning in social networks,” INFORM FUSION, vol. 88, pp. 100–114, 2022.
  40. M. Alfaro-Contreras, J. J. Valero-Mas, J. M. Iñesta, and J. Calvo-Zaragoza, “Late multimodal fusion for image and audio music transcription,” EXPERT SYST APPL, vol. 216, p. 119491, 2023.
  41. B. Liu, L. He, Y. Xie, Y. Xiang, L. Zhu, and W. Ding, “Minjot: Multimodal infusion joint training for noise learning in text and multimodal classification problems,” INFORM FUSION, vol. 102, p. 102071, 2024.
  42. X. Zheng, C. Tang, Z. Wan, C. Hu, and W. Zhang, “Multi-level confidence learning for trustworthy multimodal classification,” in AAAI, 2023, pp. 11 381–11 389.
  43. C. Du, T. Li, Y. Liu, Z. Wen, T. Hua, Y. Wang, and H. Zhao, “Improving multi-modal learning with uni-modal teachers,” CoRR, vol. abs/2106.11059, 2021.
  44. Y. Sun, S. Mai, and H. Hu, “Learning to balance the learning rates between various modalities via adaptive tracking factor,” SPL, vol. 28, pp. 1650–1654, 2021.
  45. D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in NIPS, 2015, pp. 2575–2583.
  46. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks,” in CVPR, 2015, pp. 648–656.
  47. R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang, “Raise a child in large language model: Towards effective and generalizable fine-tuning,” in EMNLP, 2021, pp. 9514–9528.
  48. H. Zhang, G. Li, J. Li, Z. Zhang, Y. Zhu, and Z. Jin, “Fine-tuning pre-trained language models effectively by optimizing subnetworks adaptively,” in NIPS, 2022, pp. 21 442–21 454.
  49. Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view classification with dynamic evidential fusion,” TPAMI, vol. 45, no. 2, pp. 2551–2566, 2023.
  50. K. Sridharan and S. M. Kakade, “An information theoretic framework for multi-view learning,” in COLT, 2008, pp. 403–414.
  51. M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, R. D. Hjelm, and A. C. Courville, “Mutual information neural estimation,” in ICML, 2018, pp. 530–539.
  52. P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “CLUB: A contrastive log-ratio upper bound of mutual information,” in ICML, 2020, pp. 1779–1788.
  53. J. Sourati, M. Akçakaya, D. Erdogmus, T. K. Leen, and J. G. Dy, “A probabilistic active learning algorithm based on fisher information ratio,” TPAMI, vol. 40, no. 8, pp. 2023–2029, 2018.
  54. S. P. Singh and D. Alistarh, “Woodfisher: Efficient second-order approximation for neural network compression,” in NIPS, 2020.
  55. R. A. Fisher, “On the mathematical foundations of theoretical statistics,” Phil. Trans., vol. 222, no. 594-604, pp. 309–368, 1922.
  56. M. Tu, V. Berisha, Y. Cao, and J. Seo, “Reducing the model order of deep neural networks using information theory,” in ISVLSI, 2016, pp. 93–98.
  57. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” PNAS, vol. 114, no. 13, pp. 3521–3526, 2017.
  58. S. Gopal, “Adaptive sampling for sgd by exploiting side information,” in ICML, 2016, pp. 364–372.
  59. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  60. H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: crowd-sourced emotional multimodal actors dataset,” IEEE T AFFECT COMPUT, vol. 5, no. 4, pp. 377–390, 2014.
  61. Y. Cai, H. Cai, and X. Wan, “Multi-modal sarcasm detection in twitter with hierarchical fusion model,” in ACL, 2019, pp. 2506–2515.
  62. J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal sentiment classification,” in IJCAI, 2019, pp. 5408–5414.
  63. P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks,” in CVPR, 2016, pp. 4207–4215.
  64. Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network for named entity recognition in tweets,” in AAAI.   AAAI Press, 2018, pp. 5674–5681.
  65. N. Fujimori, R. Endo, Y. Kawai, and T. Mochizuki, “Modality-specific learning rate control for multimodal classification,” in ACPR, 2020, pp. 412–422.
  66. Y. Yang, J. Zhang, F. Gao, X. Gao, and H. Zhu, “DOMFN: A divergence-orientated multi-modal fusion network for resume assessment,” in ACM MM, 2022, pp. 1612–1620.
  67. H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in ICASSP, 2020, pp. 721–725.
  68. B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in SciPy, 2015, pp. 18–24.
  69. J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in CVPR, 2017, pp. 4724–4733.
  70. J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J MACH LEARN RES, vol. 12, no. 7, pp. 2121–2159, 2011.
  71. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com