Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReconBoost: Boosting Can Achieve Modality Reconcilement (2405.09321v1)

Published 15 May 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at https://github.com/huacong/ReconBoost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Gradient boosting neural networks: Grownet. ArXiv, abs/2002.07971, 2020.
  2. Openface 2.0: Facial behavior analysis toolkit. In IEEE International Conference on Automatic Face and Gesture Recognition, pp.  59–66, 2018.
  3. Multimodal machine learning: A survey and taxonomy. IEEE TPAMI, 41(2):423–443, 2019.
  4. Heterogeneous image feature integration via multi-modal spectral clustering. In CVPR, pp.  1977–1984, 2011.
  5. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affective Comput., 5(4):377–390, 2014.
  6. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell, 40(8):865–878, 2022.
  7. Xgboost: A scalable tree boosting system. In SIGKDD, pp.  785–794, 2016.
  8. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In ECCV, pp.  561–577, 2020.
  9. Covarep — a collaborative voice analysis repository for speech technologies. In ICASSP, pp.  960–964, 2014.
  10. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE TPAMI, 43(10):3333–3348, 2021.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. On uni-modal feature learning in supervised multi-modal learning. In ICML, pp.  25, 2023.
  13. Pmr: Prototypical modal rebalance for multimodal learning. In CVPR, pp.  20029–20038, 2023.
  14. Shrec’22 track: Open-set 3d object retrieval. Computers & Graphics, 107:231–240, 2022.
  15. Freund, Y. Boosting a weak learning algorithm by majority. Inf. Comput., 121(2):256–285, 1995.
  16. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, 1997.
  17. Experiments with a new boosting algorithm. In ICML, pp.  148–156, 1996.
  18. Friedman, J. H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 – 1232, 2001.
  19. Event-based vision: A survey. IEEE TPAMI, 44(1):154–180, 2022.
  20. Superfast: 200× video frame interpolation via event camera. IEEE TPAMI, 45(6):7764–7780, 2023.
  21. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. ArXiv, abs/1802.09972, 2018.
  22. Trusted multi-view classification. In ICLR, 2021.
  23. Deep residual learning for image recognition. In CVPR, 2016.
  24. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In EMNLP, pp.  861–877, 2020.
  25. Learning discrete representations via information maximizing self-augmented training. In ICML, pp.  1558–1567, 2017.
  26. Learning deep resnet blocks sequentially using boosting theory. ArXiv, abs/1706.04964, 2018.
  27. Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). In ICML, pp.  9226–9259, 2022.
  28. Boost then convolve: Gradient boosting meets graph neural networks. In ICLR, 2021.
  29. Dm2c: Deep mixed-modal clustering. In NeurIPS, pp.  5880–5890, 2019.
  30. Hierarchical set-to-set representation for 3-d cross-modal retrieval. IEEE TNNLS, pp.  1–13, 2023a.
  31. Event-based low-illumination image enhancement. IEEE TMM, pp.  1–12, 2023b.
  32. Cross-modal center loss for 3d cross-modal retrieval. In CVPR, pp.  3142–3151, 2021.
  33. Mmtm: Multimodal transfer module for cnn fusion. In CVPR, pp.  13289–13299, 2020.
  34. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  35. Modeling multiple views via implicitly preserving global consistency and local complementarity. IEEE TKDE, 2022.
  36. Deep collaborative embedding for social image understanding. IEEE TPAMI, 41(9):2070–2083, 2019.
  37. Factorized contrastive learning: Going beyond multi-view redundancy. In NeurIPS, 2023a.
  38. Factorized contrastive learning: Going beyond multi-view redundancy. In NeurIPS, pp.  32971–32998, 2023b.
  39. Multiviz: Towards visualizing and understanding multimodal models. In ICLR, 2023c.
  40. librosa: Audio and music signal analysis in python. In SciPy, pp.  18–24, 2015.
  41. Attention bottlenecks for multimodal fusion. In NeurIPS, pp.  14200–14213, 2021.
  42. Automatic differentiation in pytorch. 2017.
  43. Balanced multimodal learning via on-the-fly gradient modulation. In CVPR, pp.  8238–8247, 2022.
  44. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  45. Exploring complex and heterogeneous correlations on hypergraph for the prediction of drug-target interactions. Patterns, 2(12), 2021.
  46. Efficient rgb-d semantic segmentation for indoor scene analysis. ArXiv, abs/2011.06961, 2021.
  47. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE TPAMI, 40(5):1045–1058, 2017.
  48. Shalev-Shwartz, S. Selfieboost: A boosting algorithm for deep learning. ArXiv, abs/1411.3436, 2014.
  49. Identity-invariant representation and transformer-style relation for micro-expression recognition. Applied Intelligence, 53(17):19860–19871, 2023.
  50. Joint facial action unit recognition and self-supervised optical flow estimation. Pattern Recognition Letters, 181:70–76, 2024.
  51. Adagcn: Adaboosting graph convolutional networks into deep models. ArXiv, abs/1908.05081, 2019.
  52. Tri-clustered tensor completion for social-aware image tag refinement. IEEE TPAMI, 39(8):1662–1674, 2017.
  53. Audio-visual event localization in unconstrained videos. In ECCV, pp.  247–263, 2018.
  54. Visualizing data using t-sne. JMLR, 9(11), 2008.
  55. Rpeflow: Multimodal fusion of rgb-pointcloud-event for joint optical flow and scene flow estimation. In ICCV, pp.  10030–10040, 2023.
  56. What makes training multi-modal classification networks hard? In CVPR, pp.  12692–12702, 2020a.
  57. Deep multimodal fusion by channel exchanging. In NeurIPS, pp.  4835–4845, 2020b.
  58. Rnve: A real nighttime vision enhancement benchmark and dual-stream fusion network. IEEE Signal Process Lett., 31:131–135, 2024.
  59. Learning in audio-visual context: A review, analysis, and new perspective, 2022.
  60. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language, 2018. doi: 10.18653/v1/W18-3302.
  61. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In ICML, pp.  24043–24055, 2022.
  62. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pp.  1912–1920, 2015.
  63. Avqa: A dataset for audio-visual question answering on videos. In ACM MM, pp.  3480–3491, 2022.
  64. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Annual Meeting of the Association for Computational Linguistics, 2020.
  65. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. ArXiv, abs/1606.06259, 2016.
  66. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics, 2018.
  67. Causal intervention for weakly-supervised semantic segmentation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, volume 33, pp.  655–666, 2020.
  68. Provable dynamic fusion for low-quality multimodal data. In ICML, pp.  17, 2023.
  69. Multimodal fusion on low-quality data: A comprehensive survey, 2024.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com