Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Missing Modality in Multimodal Egocentric Datasets (2401.11470v2)

Published 21 Jan 2024 in cs.CV

Abstract: Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missing modalities on egocentric action recognition, particularly within transformer-based models. We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets. Our method mitigates the performance loss, reducing it from its original $\sim 30\%$ drop to only $\sim 10\%$ when half of the test set is modal-incomplete. Through extensive experimentation, we demonstrate the adaptability of MMT to different training scenarios and its superiority in handling missing modalities compared to current methods. Our research contributes a comprehensive analysis and an innovative approach, opening avenues for more resilient multimodal systems in real-world settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Medical image segmentation on mri images with missing modalities: a review (2022). URL: https://arxiv. org/abs/2203.06217, doi, 10.
  2. Localizing moments in long video via multimodal guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13667–13678, 2023.
  3. Improving multimodal fusion via mutual dependency maximisation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 231–245, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  4. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.
  5. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
  8. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  9. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
  10. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020.
  11. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16144–16154, 2023.
  12. Mmg-ego4d: Multimodal generalization in egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6481–6491, 2023.
  13. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  14. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  15. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  16. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  17. Mavil: Masked audio-video learners. arXiv preprint arXiv:2212.08071, 2022.
  18. Epic-sounds: A large-scale dataset of actions that sound. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  19. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  20. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
  21. With a little help from my temporal context: Multimodal egocentric action recognition. In British Machine Vision Conference (BMVC), 2021a.
  22. Slow-fast auditory streams for audio recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859. IEEE, 2021b.
  23. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  24. Audio feature generation for missing modality problem in video action recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3956–3960. IEEE, 2019.
  25. Multimodal prompting with missing modalities for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14943–14952, 2023.
  26. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, pages 11336–11344, 2020.
  27. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  28. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
  29. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  30. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2302–2310, 2021.
  31. Are multimodal transformers robust to missing modality? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18177–18186, 2022.
  32. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  33. Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2015.
  34. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  35. Training strategies to handle missing modalities for audio-visual expression recognition. In Companion Publication of the 2020 International Conference on Multimodal Interaction, pages 400–404, 2020.
  36. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
  37. Multimodal distillation for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5213–5224, 2023.
  38. Owl (observe, watch, listen): Audiovisual temporal context for localizing actions in egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4879–4889, 2023.
  39. Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176, 2018.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Lrmm: Learning to recommend with missing modalities. arXiv preprint arXiv:1808.06791, 2018.
  42. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
  43. Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2608–2618, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Merey Ramazanova (9 papers)
  2. Alejandro Pardo (14 papers)
  3. Humam Alwassel (9 papers)
  4. Bernard Ghanem (255 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com