Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

Published 13 Apr 2024 in cs.CV, cs.AI, and cs.LG | (2404.08937v1)

Abstract: We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-LLM which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based initialisations. In addition, the effect of initialising query tokens using a masked LLM fine-tuned on a text corpus of known behavioural patterns is explored. We evaluate our system on the PanAf500 and PanAf20K datasets and demonstrate the performance benefits of our multi-modal decoding approach and query initialisation strategy on multi-class and multi-label recognition tasks, respectively. Results and ablations corroborate performance improvements. We achieve state-of-the-art performance over vision and vision-LLMs in top-1 accuracy (+6.34%) on PanAf500 and overall (+1.1%) and tail-class (+2.26%) mean average precision on PanAf20K. We share complete source code and network weights for full reproducibility of results and easy utilisation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Wwf (2022) living planet report 2022 - building a nature-positive society. 2022.
  2. Automated audiovisual behavior recognition in wild primates. Science Advances, 7(46):eabi4883, 2021.
  3. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), July 2021.
  4. Triple-stream deep metric learning of great ape behavioural actions. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pages 294–302, 2023.
  5. Panaf20k: A large video dataset for wild ape detection & behaviour analysis. International Journal of Computer Vision (IJCV), 2024.
  6. Wildlife camera trapping: a review and recommendations for linking surveys to ecological processes. Journal of applied ecology, 52(3):675–685, 2015.
  7. Using nonhuman culture in conservation requires careful and concerted action. Conservation Letters, 15(2):e12860, 2022.
  8. The role of great ape behavioral ecology in one health: Implications for captive welfare and re-habilitation success. American journal of primatology, 84(4-5):e23328, 2022.
  9. The future of artificial intelligence in monitoring animal identification, health, and behaviour, 2022.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
  11. Why conservation biology can benefit from sensory ecology. Nature Ecology & Evolution, 4(4):502–511, 2020.
  12. Learning visual representations via language-guided sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19208–19220, 2023.
  13. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops,, pages 3154–3160, 2017.
  14. IUCN. Iucn red list of threatened species version 2022.1. 2022.
  15. Environmental variability supports chimpanzee behavioural diversity. Nature Communications, 11(1):4451, 2020.
  16. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  17. Animal biometrics: quantifying and detecting phenotypic appearance. TREE, 28(7):432–441, 2013.
  18. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021.
  19. Mvitv2: Improved multiscale vision transformers for classification and detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4804, 2021.
  20. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022.
  21. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022.
  22. Ethogram and ethnography of mahale chimpanzees. Anthropological Science, 107(2):141–188, 1999.
  23. Chimpanzee behavior in the wild: an audio-visual encyclopedia. Springer Science & Business Media, 2010.
  24. Use your head: Improving long-tail video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2415–2425, 2023.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Visual recognition of great ape behaviours in the wild. In Workshop on the Visual Observation and Analysis of Vertebrate and Insect Behaviour, 2020.
  27. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, volume 27, 2014.
  28. Scaling-up camera traps: Monitoring the planet’s biodiversity with networks of remote sensors. Frontiers in Ecology and the Environment, 15(1):26–34, 2017.
  29. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  30. Perspectives in machine learning for wildlife conservation. Nature communications, 13(1):1–15, 2022.
  31. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022.
  32. Exploring vision-language models for imbalanced learning. International Journal of Computer Vision, 132(1):224–237, 2024.
  33. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. ArXiv, abs/2209.06430, 2022.
  34. Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022, 2022.
  35. Ethograms and the diversity of behaviors, page 510–518. Cambridge University Press, 2015.
  36. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  37. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.