Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attribution Regularization for Multimodal Paradigms (2404.02359v1)

Published 2 Apr 2024 in cs.LG

Abstract: Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dominates the decision-making process, resulting in suboptimal performance. This research project aims to address these challenges by proposing a novel regularization term that encourages multimodal models to effectively utilize information from all modalities when making decisions. The focus of this project lies in the video-audio domain, although the proposed regularization technique holds promise for broader applications in embodied AI research, where multiple modalities are involved. By leveraging this regularization term, the proposed approach aims to mitigate the issue of unimodal dominance and improve the performance of multimodal machine learning systems. Through extensive experimentation and evaluation, the effectiveness and generalizability of the proposed technique will be assessed. The findings of this research project have the potential to significantly contribute to the advancement of multimodal machine learning and facilitate its application in various domains, including multimedia analysis, human-computer interaction, and embodied AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. Vggsound: A large-scale audio-visual dataset, 2020.
  2. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377–390, 2014.
  3. Learning spatiotemporal features with 3d convolutional networks, 2015.
  4. Learning spatio-temporal representation with pseudo-3d residual networks, 2017.
  5. Actions   transformations, 2016.
  6. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  7. Greedy gradient ensemble for robust visual question answering, 2021.
  8. What makes training multi-modal classification networks hard?, 2020.
  9. Balanced multimodal learning via on-the-fly gradient modulation, 2022.
  10. Improving multi-modal learning with uni-modal teachers, 2021.
  11. Towards better understanding of gradient-based attribution methods for deep neural networks, 2018.
  12. Not just a black box: Learning important features through propagating activation differences, 2017.
  13. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets