Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition (2307.04132v3)
Abstract: In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- How do you do it? fine-grained action understanding with pseudo-adverbs. arXiv preprint arXiv:2203.12344, 2022.
- Action modifiers: Learning from adverbs in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 868–878, 2020.
- Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision, pp. 670–688. Springer, 2020.
- Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp. 363–370. Springer, 2003.
- Smart frame selection for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 1451–1459, 2021.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- Fastlas: Scalable inductive logic programming incorporating domain-specific optimisation criteria. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 2877–2885, 2020.
- Lifschitz, V. Answer set programming. Springer Berlin, 2019.
- Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22952–22961, 2023.
- Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition, 124:108487, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640, 2019.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Learning action changes by measuring verb-adverb textual relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23110–23118, 2023.
- Human action adverb recognition: Adha dataset and a three-stream hybrid model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2325–2334, 2018.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1268–1277, 2019.
- Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591, 2019.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296, 2016.
- Amrit Diggavi Seshadri (4 papers)
- Alessandra Russo (48 papers)