Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Compositional Representations for Few-shot Action Recognition (2208.09424v3)

Published 19 Aug 2022 in cs.CV

Abstract: Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  2. On the differentiability of the solution to convex optimization problems. arXiv preprint arXiv:1804.05098 .
  3. Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  4. Taen: Temporal aware embedding network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  5. Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’analyse des données .
  6. Is space-time attention all you need for video understanding?, in: ICML.
  7. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 .
  8. Few-shot learning of video action recognition only based on video contents, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
  9. Few-shot video classification via temporal alignment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  10. Realtime multi-person 2d pose estimation using part affinity fields, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  11. Potion: Pose motion representation for action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  12. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 .
  13. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  14. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  15. Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  16. Pictorial structures for object recognition. International journal of computer vision .
  17. Towards scalable representations of object categories: Learning a hierarchy of parts, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition.
  18. Model-agnostic meta-learning for fast adaptation of deep networks, in: International Conference on Machine Learning.
  19. Large-scale weakly-supervised pre-training for video action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  20. Learn2augment: learning to composite videos for data augmentation in action recognition, in: European conference on computer vision.
  21. Low-shot visual recognition by shrinking and hallucinating features, in: Proceedings of the IEEE International Conference on Computer Vision.
  22. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  23. Parts of recognition. Cognition .
  24. Action genome: Actions as compositions of spatio-temporal scene graphs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  25. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .
  26. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems .
  27. Hmdb: a large video database for human motion recognition, in: 2011 International Conference on Computer Vision.
  28. Protogan: Towards few shot learning for action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
  29. Action recognition by hierarchical mid-level action elements, in: Proceedings of the IEEE international conference on computer vision.
  30. Pas-net: pose-based and appearance-based spatiotemporal networks fusion for action recognition, in: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
  31. Ta2n: Two-stage action alignment network for few-shot action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence.
  32. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding .
  33. Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  34. Task-adaptive spatial-temporal video sampler for few-shot action recognition, in: Proceedings of the 30th ACM International Conference on Multimedia.
  35. Few-shot action recognition with compromised metric via optimal transport. arXiv preprint arXiv:2104.03737 .
  36. Language-motivated approaches to action recognition. The Journal of Machine Learning Research .
  37. Vision: A computational investigation into the human representation and processing of visual information. MIT press.
  38. A generative approach to zero-shot and few-shot action recognition, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
  39. A survey of recent advances in hierarchical clustering algorithms. The computer journal .
  40. Temporal-relational crosstransformers for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  41. Tiny video networks. Applied AI Letters .
  42. Learning spatio-temporal representation with pseudo-3d residual networks, in: proceedings of the IEEE International Conference on Computer Vision.
  43. Optimization as a model for few-shot learning, in: International conference on learning representations.
  44. Clustering methods, in: Data mining and knowledge discovery handbook. Springer.
  45. Finegym: A hierarchical video dataset for fine-grained action understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  46. Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems.
  47. Prototypical networks for few-shot learning. Advances in neural information processing systems .
  48. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .
  49. Learning to compare: Relation network for few-shot learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  50. Learning similarity: Feature-aligning network for few-shot action recognition, in: 2019 International Joint Conference on Neural Networks (IJCNN).
  51. Spatio-temporal relation modeling for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  52. Learning compositional representations for few-shot recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  53. Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision.
  54. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 .
  55. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
  56. Matching networks for one shot learning. Advances in neural information processing systems .
  57. Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision.
  58. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 .
  59. Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision.
  60. Non-local neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  61. Videos as space-time region graphs, in: Proceedings of the European conference on computer vision (ECCV).
  62. Hybrid relation guided set matching for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  63. Spatiotemporal pyramid network for video action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
  64. Motion-modulated temporal fragment alignment network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  65. Harnessing object and scene semantics for large-scale video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  66. Beyond short snippets: Deep networks for video classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  67. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  68. Multi-level second-order few-shot learning. IEEE Transactions on Multimedia .
  69. Few-shot action recognition with permutation-invariant attention, in: Proceedings of the European Conference on Computer Vision (ECCV).
  70. V4d: 4d convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442 .
  71. Few-shot action recognition with hierarchical matching and contrastive learning, in: European Conference on Computer Vision.
  72. Compound memory networks for few-shot video classification, in: Proceedings of the European Conference on Computer Vision (ECCV).
Citations (14)

Summary

We haven't generated a summary for this paper yet.