Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition (2307.04132v3)

Published 9 Jul 2023 in cs.CV, cs.AI, and cs.SC

Abstract: In this work, following the intuition that adverbs describing scene-sequences are best identified by reasoning over high-level concepts of object-behavior, we propose the design of a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types. Importantly, while previous works for general scene adverb-recognition assume knowledge of the clips underlying action-types, our method is directly applicable in the more general problem setting where the action-type of a video-clip is unknown. Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips and propose novel symbolic and transformer based reasoning methods that operate over these extracted facts to identify adverb-types. Experiment results demonstrate that our proposed methods perform favourably against the previous state-of-the-art. Additionally, to support efforts in symbolic video-processing, we release two new datasets of object-behaviour-facts extracted from raw video clips - the MSR-VTT-ASP and ActivityNet-ASP datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.  961–970, 2015.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6299–6308, 2017.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. How do you do it? fine-grained action understanding with pseudo-adverbs. arXiv preprint arXiv:2203.12344, 2022.
  6. Action modifiers: Learning from adverbs in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  868–878, 2020.
  7. Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision, pp.  670–688. Springer, 2020.
  8. Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp.  363–370. Springer, 2003.
  9. Smart frame selection for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  1451–1459, 2021.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  11. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  1725–1732, 2014.
  12. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  13. Fastlas: Scalable inductive logic programming incorporating domain-specific optimisation criteria. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  2877–2885, 2020.
  14. Lifschitz, V. Answer set programming. Springer Berlin, 2019.
  15. Video test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22952–22961, 2023.
  16. Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition, 124:108487, 2022.
  17. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2630–2640, 2019.
  18. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  19. Learning action changes by measuring verb-adverb textual relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23110–23118, 2023.
  20. Human action adverb recognition: Adha dataset and a three-stream hybrid model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  2325–2334, 2018.
  21. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  22. Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1268–1277, 2019.
  23. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  24. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp.  4489–4497, 2015.
  25. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4581–4591, 2019.
  26. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5288–5296, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Amrit Diggavi Seshadri (4 papers)
  2. Alessandra Russo (48 papers)