Improving Zero-Shot Action Recognition using Human Instruction with Text Description (2301.08874v2)
Abstract: Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its accuracy. In addition, our model can be continuously optimized to improve the accuracy by repeating human instructions. The results with UCF101 and HMDB51 showed that our model achieved the best accuracy and improved the accuracies of other models.
- Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE, 2009.
- Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, 36(3):453–465, 2013.
- Zero-shot action recognition with error-correcting output codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2833–2842, 2017.
- Towards universal representation for unseen action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9436–9445, 2018.
- I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8303–8311, 2019.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. 2012.
- Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
- Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013.
- Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
- Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 591–600, 2020.
- Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2020.
- A motion-aware convlstm network for action recognition. Applied Intelligence, 49(7):2515–2521, 2019.
- A combined multiple action recognition and summarization for surveillance video sequences. Applied Intelligence, 51(2):690–712, 2021.
- Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10457–10467, 2020.
- Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921, 2019.
- Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192, 2020.
- Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3595–3603, 2019.
- Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020.
- A multimodal approach for human activity recognition based on skeleton and rgb data. Pattern Recognition Letters, 131:293–299, 2020.
- Zero-shot action recognition with three-stream graph convolutional networks. Sensors, 21(11):3793, 2021.
- Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
- Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3174–3183, 2017.
- Learning visual-and-semantic knowledge embedding for zero-shot image classification. Applied Intelligence, pages 1–15, 2022.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Progressive ensemble networks for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11728–11736, 2019.
- Vgse: Visually-grounded semantic embeddings for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9316–9325, 2022.
- Robust region feature synthesizer for zero-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7622–7631, 2022.
- Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484, 2019.
- 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Generalized zero-shot learning for action recognition with web-scale video data. World Wide Web, 22(2):807–824, 2019.
- Reformulating zero-shot action recognition for multi-label actions. Advances in Neural Information Processing Systems, 34:25566–25577, 2021.
- Ventral & dorsal stream theory based zero-shot action recognition. Pattern Recognition, 116:107953, 2021.
- Natural language question answering: the view from here. natural language engineering, 7(4):275–300, 2001.
- QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1574–1583, 2021.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 175–182, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015.
- Out-of-distribution detection for generalized zero-shot action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9985–9993, 2019.
- Zero-shot learning for action recognition using synthesized features. Neurocomputing, 390:117–130, 2020.
- Vdarn: Video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Networks, 113:102380, 2021.
- Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13638–13647, 2021.
- Learning using privileged information for zero-shot action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 773–788, 2022.
- Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19978–19988, 2022.
- Nan Wu (84 papers)
- Hiroshi Kera (39 papers)
- Kazuhiko Kawamoto (22 papers)