MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance (2405.17730v1)
Abstract: Multimodal learning methods with targeted unimodal learning objectives have exhibited their superior efficacy in alleviating the imbalanced multimodal learning problem. However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization. To well diminish these conflicts, we observe the discrepancy between multimodal loss and unimodal loss, where both gradient magnitude and covariance of the easier-to-learn multimodal loss are smaller than the unimodal one. With this property, we analyze Pareto integration under our multimodal scenario and propose MMPareto algorithm, which could ensure a final gradient with direction that is common to all learning objectives and enhanced magnitude to improve generalization, providing innocent unimodal assistance. Finally, experiments across multiple types of modalities and frameworks with dense cross-modal interaction indicate our superior and extendable method performance. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty, demonstrating its ideal scalability. The source code and dataset are available at https://github.com/GeWu-Lab/MMPareto_ICML2024.
- Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617, 2017.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
- Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1):38–56, 2023.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020.
- Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pp. 794–803. PMLR, 2018.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
- Désidéri, J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
- On uni-modal feature learning in supervised multi-modal learning. In International Conference on Machine Learning, pp. 8632–8656. PMLR, 2023.
- Maxgnr: A dynamic weight strategy via maximizing gradient-to-noise ratio for multi-task learning. In Proceedings of the Asian Conference on Computer Vision, pp. 1178–1193, 2022.
- Pmr: Prototypical modal rebalance for multimodal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20029–20038, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Metabalance: improving multi-task recommendations via adapting gradient magnitudes of auxiliary tasks. In Proceedings of the ACM Web Conference 2022, pp. 2205–2215, 2022.
- Flat minima. Neural computation, 9(1):1–42, 1997.
- Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International Conference on Machine Learning, pp. 9226–9259. PMLR, 2022.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9012–9020, 2019.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
- Boosting multi-modal model performance with adaptive gradient modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22214–22224, 2023.
- Pareto multi-task learning. Advances in neural information processing systems, 32, 2019.
- Federated learning on multimodal data: A comprehensive survey. Machine Intelligence Research, 20(4):539–553, 2023.
- End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1871–1880, 2019.
- Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331–20342, 2020.
- Efficient continuous pareto exploration in multi-task learning. In International Conference on Machine Learning, pp. 6522–6531. PMLR, 2020.
- Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
- Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.
- Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247, 2022.
- Dynamic routing between capsules. Advances in neural information processing systems, 30, 2017.
- Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
- Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953, 2015.
- Multimodal uncertainty reduction for intention recognition in human-robot interaction. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7009–7016. IEEE, 2019.
- What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12695–12705, 2020.
- Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20(4):447–482, 2023.
- Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579, 2022.
- Enhancing multimodal cooperation via sample-level modality valuation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning, pp. 24043–24055. PMLR, 2022.
- 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920, 2015.
- Mmcosine: Multi-modal cosine loss towards balanced audio-visual fine-grained learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Quantifying and enhancing multi-modal robustness with modality preference. arXiv preprint arXiv:2402.06244, 2024.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
- Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259, 2016.
- Multimodal pretraining from monolingual to multilingual. Machine Intelligence Research, 20(2):220–232, 2023.
- Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947, 2024.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.