Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Compositional Representations for Few-shot Action Recognition (2208.09424v3)

Published 19 Aug 2022 in cs.CV

Abstract: Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  2. On the differentiability of the solution to convex optimization problems. arXiv preprint arXiv:1804.05098 .
  3. Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  4. Taen: Temporal aware embedding network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  5. Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’analyse des données .
  6. Is space-time attention all you need for video understanding?, in: ICML.
  7. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 .
  8. Few-shot learning of video action recognition only based on video contents, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
  9. Few-shot video classification via temporal alignment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  10. Realtime multi-person 2d pose estimation using part affinity fields, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  11. Potion: Pose motion representation for action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  12. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 .
  13. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  14. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  15. Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  16. Pictorial structures for object recognition. International journal of computer vision .
  17. Towards scalable representations of object categories: Learning a hierarchy of parts, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition.
  18. Model-agnostic meta-learning for fast adaptation of deep networks, in: International Conference on Machine Learning.
  19. Large-scale weakly-supervised pre-training for video action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  20. Learn2augment: learning to composite videos for data augmentation in action recognition, in: European conference on computer vision.
  21. Low-shot visual recognition by shrinking and hallucinating features, in: Proceedings of the IEEE International Conference on Computer Vision.
  22. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  23. Parts of recognition. Cognition .
  24. Action genome: Actions as compositions of spatio-temporal scene graphs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  25. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .
  26. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems .
  27. Hmdb: a large video database for human motion recognition, in: 2011 International Conference on Computer Vision.
  28. Protogan: Towards few shot learning for action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
  29. Action recognition by hierarchical mid-level action elements, in: Proceedings of the IEEE international conference on computer vision.
  30. Pas-net: pose-based and appearance-based spatiotemporal networks fusion for action recognition, in: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
  31. Ta2n: Two-stage action alignment network for few-shot action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence.
  32. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding .
  33. Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  34. Task-adaptive spatial-temporal video sampler for few-shot action recognition, in: Proceedings of the 30th ACM International Conference on Multimedia.
  35. Few-shot action recognition with compromised metric via optimal transport. arXiv preprint arXiv:2104.03737 .
  36. Language-motivated approaches to action recognition. The Journal of Machine Learning Research .
  37. Vision: A computational investigation into the human representation and processing of visual information. MIT press.
  38. A generative approach to zero-shot and few-shot action recognition, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
  39. A survey of recent advances in hierarchical clustering algorithms. The computer journal .
  40. Temporal-relational crosstransformers for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  41. Tiny video networks. Applied AI Letters .
  42. Learning spatio-temporal representation with pseudo-3d residual networks, in: proceedings of the IEEE International Conference on Computer Vision.
  43. Optimization as a model for few-shot learning, in: International conference on learning representations.
  44. Clustering methods, in: Data mining and knowledge discovery handbook. Springer.
  45. Finegym: A hierarchical video dataset for fine-grained action understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  46. Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems.
  47. Prototypical networks for few-shot learning. Advances in neural information processing systems .
  48. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .
  49. Learning to compare: Relation network for few-shot learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  50. Learning similarity: Feature-aligning network for few-shot action recognition, in: 2019 International Joint Conference on Neural Networks (IJCNN).
  51. Spatio-temporal relation modeling for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  52. Learning compositional representations for few-shot recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  53. Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision.
  54. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 .
  55. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
  56. Matching networks for one shot learning. Advances in neural information processing systems .
  57. Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision.
  58. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 .
  59. Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision.
  60. Non-local neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  61. Videos as space-time region graphs, in: Proceedings of the European conference on computer vision (ECCV).
  62. Hybrid relation guided set matching for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  63. Spatiotemporal pyramid network for video action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
  64. Motion-modulated temporal fragment alignment network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  65. Harnessing object and scene semantics for large-scale video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  66. Beyond short snippets: Deep networks for video classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
  67. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  68. Multi-level second-order few-shot learning. IEEE Transactions on Multimedia .
  69. Few-shot action recognition with permutation-invariant attention, in: Proceedings of the European Conference on Computer Vision (ECCV).
  70. V4d: 4d convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442 .
  71. Few-shot action recognition with hierarchical matching and contrastive learning, in: European Conference on Computer Vision.
  72. Compound memory networks for few-shot video classification, in: Proceedings of the European Conference on Computer Vision (ECCV).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Changzhen Li (3 papers)
  2. Jie Zhang (847 papers)
  3. Shuzhe Wu (5 papers)
  4. Xin Jin (285 papers)
  5. Shiguang Shan (136 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.