Hierarchical Compositional Representations for Few-shot Action Recognition (2208.09424v3)
Abstract: Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.
- Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
- On the differentiability of the solution to convex optimization problems. arXiv preprint arXiv:1804.05098 .
- Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Taen: Temporal aware embedding network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’analyse des données .
- Is space-time attention all you need for video understanding?, in: ICML.
- Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 .
- Few-shot learning of video action recognition only based on video contents, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
- Few-shot video classification via temporal alignment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Realtime multi-person 2d pose estimation using part affinity fields, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Potion: Pose motion representation for action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 .
- Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
- X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Pictorial structures for object recognition. International journal of computer vision .
- Towards scalable representations of object categories: Learning a hierarchy of parts, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition.
- Model-agnostic meta-learning for fast adaptation of deep networks, in: International Conference on Machine Learning.
- Large-scale weakly-supervised pre-training for video action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Learn2augment: learning to composite videos for data augmentation in action recognition, in: European conference on computer vision.
- Low-shot visual recognition by shrinking and hallucinating features, in: Proceedings of the IEEE International Conference on Computer Vision.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Parts of recognition. Cognition .
- Action genome: Actions as compositions of spatio-temporal scene graphs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems .
- Hmdb: a large video database for human motion recognition, in: 2011 International Conference on Computer Vision.
- Protogan: Towards few shot learning for action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
- Action recognition by hierarchical mid-level action elements, in: Proceedings of the IEEE international conference on computer vision.
- Pas-net: pose-based and appearance-based spatiotemporal networks fusion for action recognition, in: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
- Ta2n: Two-stage action alignment network for few-shot action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence.
- Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding .
- Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Task-adaptive spatial-temporal video sampler for few-shot action recognition, in: Proceedings of the 30th ACM International Conference on Multimedia.
- Few-shot action recognition with compromised metric via optimal transport. arXiv preprint arXiv:2104.03737 .
- Language-motivated approaches to action recognition. The Journal of Machine Learning Research .
- Vision: A computational investigation into the human representation and processing of visual information. MIT press.
- A generative approach to zero-shot and few-shot action recognition, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
- A survey of recent advances in hierarchical clustering algorithms. The computer journal .
- Temporal-relational crosstransformers for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Tiny video networks. Applied AI Letters .
- Learning spatio-temporal representation with pseudo-3d residual networks, in: proceedings of the IEEE International Conference on Computer Vision.
- Optimization as a model for few-shot learning, in: International conference on learning representations.
- Clustering methods, in: Data mining and knowledge discovery handbook. Springer.
- Finegym: A hierarchical video dataset for fine-grained action understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems.
- Prototypical networks for few-shot learning. Advances in neural information processing systems .
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .
- Learning to compare: Relation network for few-shot learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Learning similarity: Feature-aligning network for few-shot action recognition, in: 2019 International Joint Conference on Neural Networks (IJCNN).
- Spatio-temporal relation modeling for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Learning compositional representations for few-shot recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision.
- Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 .
- A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
- Matching networks for one shot learning. Advances in neural information processing systems .
- Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision.
- Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 .
- Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision.
- Non-local neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Videos as space-time region graphs, in: Proceedings of the European conference on computer vision (ECCV).
- Hybrid relation guided set matching for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Spatiotemporal pyramid network for video action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
- Motion-modulated temporal fragment alignment network for few-shot action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Harnessing object and scene semantics for large-scale video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Beyond short snippets: Deep networks for video classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition.
- Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Multi-level second-order few-shot learning. IEEE Transactions on Multimedia .
- Few-shot action recognition with permutation-invariant attention, in: Proceedings of the European Conference on Computer Vision (ECCV).
- V4d: 4d convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442 .
- Few-shot action recognition with hierarchical matching and contrastive learning, in: European Conference on Computer Vision.
- Compound memory networks for few-shot video classification, in: Proceedings of the European Conference on Computer Vision (ECCV).
- Changzhen Li (3 papers)
- Jie Zhang (847 papers)
- Shuzhe Wu (5 papers)
- Xin Jin (285 papers)
- Shiguang Shan (136 papers)