Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Efficacy of Text-Based Input Modalities for Action Anticipation (2401.12972v3)

Published 23 Jan 2024 in cs.CV, cs.AI, cs.LG, and eess.IV

Abstract: Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often complementary context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text descriptions of actions and objects can also lead to more accurate action anticipation by providing additional contextual cues, e.g., about the environment and its contents. We propose a Multi-modal Contrastive Anticipative Transformer (M-CAT), a video transformer architecture that jointly learns from multi-modal features and text descriptions of actions and objects. We train our model in two stages, where the model first learns to align video clips with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Compared to existing methods, M-CAT has the advantage of learning additional context from two types of text inputs: rich descriptions of future actions during pre-training, and, text descriptions for detected objects and actions during modality feature fusion. Through extensive experimental evaluation, we demonstrate that our model outperforms previous methods on the EpicKitchens datasets, and show that using simple text descriptions of actions and objects aid in more effective action anticipation. In addition, we examine the impact of object and action information obtained via text, and perform extensive ablations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Cobe: Contextualized object embeddings from narrated instructional video. In NeurIPS, 2020.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  6. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, pages 720–736, 2018.
  7. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020.
  8. Forecasting action through contact representations from first-person video. TPAMI, 2021.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  10. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In ICCV, 2019.
  11. Rolling-unrolling lstms for action anticipation from first-person video. TPAMI, 2020.
  12. Red: Reinforced encoder-decoder networks for action anticipation. In BMVC, 2017.
  13. Anticipative video transformer @ epic-kitchens action anticipation challenge 2021. In CVPR Workshop, 2021.
  14. Omnivore: A single model for many visual modalities. In CVPR, 2022.
  15. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  16. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  17. Action-reaction: Forecasting the dynamics of human interaction. In ECCV, 2014.
  18. Car that knows before you do: Anticipating maneuvers via learning temporal driving models. In Proceedings of the IEEE International Conference on Computer Vision, pages 3182–3190, 2015.
  19. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In ICRA, 2016.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  21. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV, 2019.
  22. Anticipating human activities using object affordances for reactive robotic response. TPAMI, 2015.
  23. Co-operative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
  24. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV), pages 619–635, 2018.
  25. Forecasting human object interaction: Joint prediction of motor attention and actions in first person video. In ECCV, 2020.
  26. Cma-clip: Cross-modality attention clip for image-text classification. arXiv preprint arXiv:2112.03562, 2021.
  27. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
  28. Simple open-vocabulary object detection with vision transformers. arxiv 2022. arXiv preprint arXiv:2205.06230.
  29. Ego-topo: Environment affor-dances from egocentric video. In CVPR, 2020.
  30. Expanding language-image pretrained models for general video recognition, 2022.
  31. Human intention estimation based on hidden markov model motion validation for safe flexible robotized warehouses. Robotics and Computer-Integrated Manufacturing, 57:182–196, 2019.
  32. Language models are unsupervised multitask learners. 2019.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Pedestrian action anticipation using contextual feature fusion in stacked rnns. arXiv preprint arXiv:2005.06582, 2020.
  35. Action anticipation using pairwise human-object interactions and transformers. IEEE Transactions on Image Processing, 30:8116–8129, 2021.
  36. Predicting the next action by modeling the abstract goal, 2023.
  37. Temporal aggregate representations for long-range video understanding. In ECCV, 2020.
  38. Technical report: Temporal aggregate representations. arXiv preprint arXiv:2106.03152, 2021.
  39. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp, 2013.
  40. Enhancing next active object-based egocentric action anticipation with guided attention. arXiv preprint arXiv:2305.12953, 2023.
  41. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  42. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
  43. Learning to anticipate egocentric actions by imagination. TIP, 2021.
  44. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022.
  45. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  46. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  47. Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In CVPR Workshop, 2021.
  48. Antgpt: Can large language models help long-term action anticipation from videos? arXiv preprint arXiv:2307.16368, 2023.
  49. Anticipative feature fusion transformer for multi-modal action anticipation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6068–6077, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Apoorva Beedu (10 papers)
  2. Karan Samel (6 papers)
  3. Irfan Essa (91 papers)
  4. Harish Haresamudram (12 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.