X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization (2403.19811v1)
Abstract: Lately, there has been growing interest in adapting vision-LLMs (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic
- Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Opening the vocabulary of egocentric actions. Advances in Neural Information Processing Systems, 36, 2024.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022a.
- Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022b.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
- Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
- Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV), 2018.
- Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022a.
- Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022b.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729, 2023.
- Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023.
- Technical report: Temporal aggregate representations. arXiv preprint arXiv:2106.03152, 2021.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
- Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
- M&m mix: A multimodal multiview transformer ensemble. arXiv preprint arXiv:2206.09852, 2022.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022.
- Real-time online video detection with temporal smoothing transformers. In European Conference on Computer Vision, pages 485–502. Springer, 2022.
- Learning video representations from large language models. In CVPR, 2023.
- Factual probing is [mask]: Learning vs. learning to recall. arXiv preprint arXiv:2104.05240, 2021.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.