Semantically Guided Representation Learning For Action Anticipation (2407.02309v1)
Abstract: Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages LLMs to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity.
- Rolling-unrolling lstms for action anticipation from first-person video. IEEE Trans. on Pattern Anal. and Mach. Intell., 43(11):4021–4036, 2021.
- Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Trans. on Pattern Anal. and Mach. Intell., 45(6):6688–6702, 2023.
- Anticipative feature fusion transformer for multi-modal action anticipation. In Proc. of the IEEE/CVF Winter Conf. on Appl. of Comput. Vis., pages 6068–6077, 2023.
- Learning to anticipate future with dynamic context removal. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 12724–12734, 2022.
- Self-regulated learning for egocentric video activity anticipation. IEEE Trans. on Pattern Anal. and Mach. Intell., 45(6):6715–6730, 2023.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 4804–4814, 2022.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 13587–13597, 2022.
- Vivit: A video vision transformer. In Proc. of the IEEE/CVF Int. Conf. on Comput. Vis., pages 6836–6846, 2021.
- Attentional pooling for action recognit. Adv. Neural Inf. Process. Syst., 30:1–10, 2017.
- Swin transformer v2: Scaling up capacity and resolution. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and pattern Recognit., pages 12009–12019, 2022.
- Learnable pooling with context gating for video classification. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and pattern Recognit. Workshop, pages 1–8, 2017.
- Anticipative video transformer. In Proc. of the IEEE/CVF Int. Conf. on Comput. Vis., pages 13505–13515, 2021.
- Anticipating visual representations from unlabeled video. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 98–106, 2016.
- Uncertainty-aware anticipation of activities. In Proc. of the IEEE/CVF Int. Conf. on Comput. Vis. Workshops, pages 0–8, 2019.
- C.H. Frederiksen. Propositional representations in psychology. In Int. Encyclopedia of the Social & Behavioral Sci.s, pages 12219–12224. Springer, 2001.
- Learning semantic relationships for better action retrieval in images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1100–1109, 2015.
- Measuring semantic relations between human activities. In Int. Joint Conf. on Nat. Lang. Process., pages 664–673, 2017.
- Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Method., 39:510–526, 2007.
- Scaling egocentric vision: The epic-kitchens dataset. In Proc. of the Eur. Conf. on Comput. Vis., pages 720–736, 2018.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. Int. J. of Comput. Vis., 130(1):33–55, 2022.
- Revisiting relation extraction in the era of large language models. In Proc. of the 61st Conf. Assoc. Comput. Linguist. Meet., pages 15566–15589, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proc. of the 2019 Conf. on Empir. Methods Nat. Lang. Process. and the 9th Int. Joint Conf. on Nat. Lang. Process., pages 3980–3990, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. on Learning Representations, pages 1–21, 2021.
- Attention is all you need. In Adv. Neural. Inf. Process. Syst., volume 30, pages 1–11, 2017.
- Learning deep transformer models for machine translation. In Proc. of the 57th Conf. Assoc. Comput. Linguist. Meet., pages 1–13, 2019.
- Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In IEEE Conf. on Comput. Vis. and Pattern Recognit., pages 3539–3548, 2017.
- Combining embedded accelerometers with computer visio for recognizing food preparation activities. In Proc. of the 2013 ACM Int. Conf. Ubiquitous Comput., pages 729–738, 2013.
- Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. on Pattern Anal. and Mach. Intell., 38(1):14–29, 2015.
- Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In Proc. of the IEEE/CVF Winter Conf. on Appl. of Comput. Vis., pages 1–9, 2016.
- Generating notifications for missing actions: Don’t forget to turn the lights off! In Proc. of the IEEE/CVF Int. Conf. on Comput. Vis., pages 4669–4677, 2015.
- Long-term activity forecasting using first-person vision. In 13th Asian Conf. on Comput. Vis., pages 346–360, 2017.
- Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In Proc. of the Eur. Conf. on Comput. Vis. Workshops, pages 0–10, 2018.
- Forecasting action through contact representations from first person video. IEEE Trans. on Pattern Anal. and Mach. Intell., 45(6):6703–6714, 2023.
- Action anticipation using latent goal learning. In Proc. of the IEEE/CVF Winter Conf. on Appl. of Comput. Vis., pages 2745–2753, 2022.
- Latency matters: Real-time action forecasting transformer. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 18759–18769, 2023.
- A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 13904–13913, 2022.
- Interaction region visual transformer for egocentric action anticipation. In Winter Conf. on Appl. of Comp. Vis., pages 6740–6750, 2024.
- Seem: A sequence entropy energy-based model for pedestrian trajectory all-then-one prediction. IEEE Trans. on Pattern Anal. and Mach. Intell., 45(1):1070–1086, 2023.
- Slowfast rolling-unrolling lstms for action anticipation in egocentric videos. In IEEE/CVF Int. Conf. on Comput. Vis. Workshops, pages 3430–3438, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 18123–18133, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. on Mach. Learning, pages 4904–4916, 2021.
- Learning transferable visual models from natural language supervision. In Int. Conf. on Mach. Learning, pages 8748–8763, 2021.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proc. of the 30th ACM Int. Conf. on Multimed., pages 638–647, 2022.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst., 34:24206–24221, 2021.
- Clover: Towards a unified video-language alignment and fusion model. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 14856–14866, 2023.
- Learning to compare: Relation network for few-shot learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1199–1208, 2018.
- Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst., 30, 2017.
- Prototypical matching networks for video object segmentation. IEEE Trans. on Imag. Process., 32:5623–5636, 2023.
- Action recognition with spatial-temporal discriminative filter banks. In Int. Conf. Comp. Vis., pages 5482–5491, 2019.
- When will you do what?-anticipating temporal occurrences of activities. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 5343–5352, 2018.
- Time-conditioned action anticipation in one shot. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 9925–9934, 2019.
- Temporal aggregate representations for long-range video understanding. In Proc. of the Eur. Conf. on Comput. Vis. (ECCV), pages 154–171. Springer, 2020.
- Learning to anticipate egocentric actions by imagination. IEEE Trans. on Imag. Process., 30:1143–1152, 2020.
- Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., pages 754–763, 2017.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Encoding recurrence into transformers. In The Eleventh Int. Conf. on Learning Representations, pages 1–13, 2022.
- Modeling temporal structure with lstm for online action detection. In Proc. of the IEEE/CVF Winter Conf. on Appl. of Comput. Vis., pages 1549–1557, 2018.
- Learning activity progression in lstms for activity detection and early detection. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and pattern Recognit., pages 1942–1950, 2016.
- Anxhelo Diko (5 papers)
- Danilo Avola (15 papers)
- Bardh Prenkaj (16 papers)
- Federico Fontana (5 papers)
- Luigi Cinque (18 papers)