ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition (2401.11654v1)
Abstract: Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video captions) can provide rich contextual information of visual concepts in videos, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action. However, all existing action video description datasets are limited in terms of the number of actions, the semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which covers a total of 1,211 common actions and provides 3.6 million action video descriptions. With the proposed ActionHub dataset, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module. Specifically, the Dual Cross-modality Alignment module utilizes both action labels and video descriptions from ActionHub to obtain rich class semantic features for feature alignment. The Cross-action Invariance Mining module exploits a cycle-reconstruction process between the class semantic feature spaces of seen actions and unseen actions, aiming to guide the model to learn cross-action invariant representations. Extensive experimental results demonstrate that our CoCo framework significantly outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e., Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in ZSAR. We will release our code, models, and the proposed ActionHub dataset.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision. Springer, 2016, pp. 20–36.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
- J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019.
- J. Zhou, K.-Y. Lin, H. Li, and W.-S. Zheng, “Graph-based high-order relation modeling for long-term action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 8984–8993.
- G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding,” arXiv preprint arXiv:2102.05095, vol. 2, no. 3, p. 4, 2021.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.
- Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” arXiv preprint arXiv:2106.13230, 2021.
- S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 638–13 647.
- A. Nagrani, P. H. Seo, B. Seybold, A. Hauth, S. Manen, C. Sun, and C. Schmid, “Learning audio-video modalities from image captions,” in European Conference on Computer Vision. Springer, 2022, pp. 407–426.
- J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
- R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706–715.
- P. Mettes, W. Thong, and C. G. Snoek, “Object priors for classifying and localizing unseen actions,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1954–1971, 2021.
- J. Gao, T. Zhang, and C. Xu, “I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8303–8311.
- P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “Learning graphs for knowledge transfer with limited labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 151–11 161.
- Y. Xu, C. Han, J. Qin, X. Xu, G. Han, and S. He, “Transductive zero-shot action recognition via visually connected graph convolutional networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3761–3769, 2021.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in ICCV, 2011.
- W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
- M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick et al., “Moments in time dataset: one million videos for event understanding,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 2, pp. 502–508, 2019.
- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018.
- J. Liang, L. Cao, X. Xiong, T. Yu, and A. Hauptmann, “Spatial-temporal alignment network for action recognition and detection,” arXiv preprint arXiv:2012.02426, 2020.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.
- J. Zhou, K.-Y. Lin, Y.-K. Qiu, and W.-S. Zheng, “Twinformer: Fine-to-coarse temporal modeling for long-term action recognition,” IEEE Transactions on Multimedia, pp. 1–14, 2023.
- J. Liang, H. Zhu, E. Zhang, and J. Zhang, “Stargazer: A transformer-based driver action detection system for intelligent transportation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3160–3167.
- J. Liang, E. Zhang, J. Zhang, and C. Shen, “Multi-dataset training of transformers for robust action recognition,” in NeurIPS, 2022.
- A. Iosifidis, A. Tefas, and I. Pitas, “Semi-supervised classification of human actions based on neural networks,” in 2014 22nd International Conference on Pattern Recognition. IEEE, 2014, pp. 1336–1341.
- T. Perrett, A. Masullo, T. Burghardt, M. Mirmehdi, and D. Damen, “Temporal-relational crosstransformers for few-shot action recognition,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 475–484.
- J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in CVPR 2011. IEEE, 2011, pp. 3337–3344.
- J. Zhou, H. Li, K.-Y. Lin, and J. Liang, “Adafocus: Towards end-to-end weakly supervised learning for long-video action understanding,” arXiv preprint arXiv:2311.17118, 2023.
- K.-Y. Lin, J.-R. Du, Y. Gao, J. Zhou, and W.-S. Zheng, “Diversifying spatial-temporal perception for video domain generalization,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018.
- F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015.
- J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal structure of decomposable motion segments for activity classification,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11. Springer, 2010, pp. 392–405.
- E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Unsupervised domain adaptation for zero-shot learning,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2452–2460.
- M. Jain, J. C. Van Gemert, T. Mensink, and C. G. Snoek, “Objects2action: Classifying and localizing actions without any video example,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4588–4596.
- C. Gan, M. Lin, Y. Yang, G. De Melo, and A. G. Hauptmann, “Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition,” in Thirtieth AAAI conference on artificial intelligence, 2016.
- A. Piergiovanni and M. Ryoo, “Learning multimodal representations for unseen activities,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 517–526.
- Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong, “Transductive multi-view embedding for zero-shot recognition and annotation,” in European conference on computer vision. Springer, 2014, pp. 584–599.
- Q. Wang and K. Chen, “Zero-shot visual recognition via bidirectional latent embedding,” International Journal of Computer Vision, vol. 124, no. 3, pp. 356–383, 2017.
- M. Hahn, A. Silva, and J. M. Rehg, “Action2vec: A crossmodal embedding approach to action learning,” arXiv preprint arXiv:1901.00484, 2019.
- M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele, “Script data for attribute-based recognition of composite activities,” in European conference on computer vision. Springer, 2012, pp. 144–157.
- C. Gan, T. Yang, and B. Gong, “Learning attributes equals multi-source domain generalization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 87–97.
- B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, “Rethinking zero-shot video classification: End-to-end training for realistic applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4613–4623.
- S. Pu, K. Zhao, and M. Zheng, “Alignment-uniformity aware representation learning for zero-shot video classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 968–19 977.
- C.-C. Lin, K. Lin, L. Wang, Z. Liu, and L. Li, “Cross-modal representation learning for zero-shot action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 978–19 988.
- A. Kerrigan, K. Duarte, Y. Rawat, and M. Shah, “Reformulating zero-shot action recognition for multi-label actions,” Advances in Neural Information Processing Systems, vol. 34, pp. 25 566–25 577, 2021.
- K. Doshi and Y. Yilmaz, “Zero-shot action recognition with transformer-based video semantic embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 4859–4868.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900.
- J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
- H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” arXiv preprint arXiv:2109.14084, 2021.
- H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned clip models are efficient video learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6545–6554.
- Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, “Frozen clip models are efficient video learners,” in European Conference on Computer Vision. Springer, 2022, pp. 388–404.
- J. Wang, Y. Ge, R. Yan, Y. Ge, K. Q. Lin, S. Tsutsui, X. Lin, G. Cai, J. Wu, Y. Shan et al., “All in one: Exploring unified video-language pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6598–6608.
- F. Cheng, X. Wang, J. Lei, D. Crandall, M. Bansal, and G. Bertasius, “Vindlu: A recipe for effective video-and-language pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 10 739–10 750.
- J. Huang, Y. Li, J. Feng, X. Wu, X. Sun, and R. Ji, “Clover: Towards a unified video-language alignment and fusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 14 856–14 866.
- L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7912–7921.
- H. Rahmani and M. Bennamoun, “Learning action recognition model from depth and skeleton videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5832–5841.
- H. Cheng and S. M. Chung, “Orthogonal moment-based descriptors for pose shape query on 3d point cloud patches,” Pattern Recognition, vol. 52, pp. 397–409, 2016.
- R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 457–10 467.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2020.
- J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y. Zhao, Y. Xie, C. Liu, Y.-G. Jiang, and L. Yuan, “Omnivl: One foundation model for image-language and video-language tasks,” Advances in neural information processing systems, vol. 35, pp. 5696–5710, 2022.
- J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using cell phone accelerometers,” ACM SigKDD Explorations Newsletter, vol. 12, no. 2, pp. 74–82, 2011.
- W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of wifi signal based human activity recognition,” in Proceedings of the 21st annual international conference on mobile computing and networking, 2015, pp. 65–76.
- C. Bretti and P. Mettes, “Zero-shot action recognition from diverse object-scene compositions,” arXiv preprint arXiv:2110.13479, 2021.
- I. Alexiou, T. Xiang, and S. Gong, “Exploring synonyms as context in zero-shot action recognition,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 4190–4194.
- D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Jun. 2011, pp. 190–200.
- A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptive video services to create a large data source for video annotation research,” arXiv preprint arXiv:1503.01070, 2015.
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3202–3212.
- Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo, “Tgif: A new dataset and benchmark on animated gif description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4641–4650.
- J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.
- L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, “Localizing moments in video with natural language,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5803–5812.
- A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. J. Pal, H. Larochelle, A. C. Courville, and B. Schiele, “Movie description,” International Journal of Computer Vision, vol. 123, pp. 94–120, 2016.
- R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: a large-scale dataset for multimodal language understanding,” arXiv preprint arXiv:1811.00347, 2018.
- X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, “Vatex: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4581–4591.
- G. Awad, A. A. Butt, K. Curtis, J. Fiscus, A. Godil, Y. Lee, A. Delgado, J. Zhang, E. Godard, B. Chocot et al., “Trecvid 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains,” arXiv preprint arXiv:2104.13473, 2021.
- J. Lei, L. Yu, T. L. Berg, and M. Bansal, “Tvr: A large-scale dataset for video-subtitle moment retrieval,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020, pp. 447–463.
- J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu, “Violin: A large-scale dataset for video-and-language inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 900–10 910.
- M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.
- M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1194–1201.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
- H. Kuehne, A. B. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), year =.
- S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
- R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850.
- Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 352–364, 2018.
- H. Zhao, A. Torralba, L. Torresani, and Z. Yan, “Hacs: Human action clips and segments dataset for recognition and temporal localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8668–8678.
- Y. Tang, D. Ding, Y. Rao, Y. Zheng, D. Zhang, L. Zhao, J. Lu, and J. Zhou, “Coin: A large-scale dataset for comprehensive instructional video analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1207–1216.
- A. Piergiovanni and M. Ryoo, “Avid dataset: Anonymized videos from diverse countries,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 711–16 721, 2020.
- D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625.
- D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Rescaling egocentric vision,” arXiv preprint arXiv:2006.13256, 2020.
- Y. Liu, L. Wang, Y. Wang, X. Ma, and Y. Qiao, “Fineaction: A fine-grained video dataset for temporal action localization,” IEEE Transactions on Image Processing, vol. 31, pp. 6937–6950, 2022.
- Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio-temporally localized sports actions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 536–13 545.
- L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 549–14 560.
- M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal, “Grounding action descriptions in videos,” Transactions of the Association for Computational Linguistics, vol. 1, pp. 25–36, 2013.
- A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, “Coherent multi-sentence video description with variable level of detail,” in Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36. Springer, 2014, pp. 184–195.
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in ECCV, 2016.
- L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of procedures from web instructional videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2630–2640.
- L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+ language omni-representation pre-training,” arXiv preprint arXiv:2005.00200, 2020.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.
- Y. Wang, D. Gao, L. Yu, W. Lei, M. Feiszli, and M. Z. Shou, “Geb+: A benchmark for generic event boundary captioning, grounding and retrieval,” in European Conference on Computer Vision. Springer, 2022, pp. 709–725.
- J. Plisson, N. Lavrac, D. Mladenic et al., “A rule based approach to word lemmatization,” in Proceedings of IS, vol. 3, 2004, pp. 83–86.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
- T. Ma, R. Li, and J. Liang, “An examination of the compositionality of large generative vision-language models,” arXiv preprint arXiv:2308.10509, 2023.
- P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
- Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao et al., “Vision-language pre-training: Basics, recent advances, and future trends,” Foundations and Trends® in Computer Graphics and Vision, vol. 14, no. 3–4, pp. 163–352, 2022.
- X. Xu, T. Hospedales, and S. Gong, “Transductive zero-shot action recognition by word-vector embedding,” International Journal of Computer Vision, vol. 123, no. 3, pp. 309–333, 2017.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A short note about kinetics-600,” arXiv preprint arXiv:1808.01340, 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby, “Big transfer (bit): General visual representation learning,” in European conference on computer vision. Springer, 2020, pp. 491–507.
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” Advances in neural information processing systems, vol. 26, 2013.
- Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-embedding for image classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 7, pp. 1425–1438, 2015.
- Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2927–2936.
- L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zero-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2021–2030.
- B. Romera-Paredes and P. Torr, “An embarrassingly simple approach to zero-shot learning,” in International conference on machine learning. PMLR, 2015, pp. 2152–2161.
- P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “All about knowledge graphs for actions,” arXiv preprint arXiv:2008.12432, 2020.
- C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009, pp. 951–958.
- X. Xu, T. Hospedales, and S. Gong, “Semantic embedding space for zero-shot action recognition,” in 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015, pp. 63–67.
- X. Xu, T. M. Hospedales, and S. Gong, “Multi-task zero-shot action recognition with prioritised data augmentation,” in European Conference on Computer Vision. Springer, 2016, pp. 343–359.
- J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, “Zero-shot action recognition with error-correcting output codes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2833–2842.
- C. Qi, Z. Feng, M. Xing, Y. Su, J. Zheng, and Y. Zhang, “Energy-based temporal summarized attentive network for zero-shot action recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 1940–1953, 2023.
- Y. Zhu, Y. Long, Y. Guan, S. Newsam, and L. Shao, “Towards universal representation for unseen action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9436–9445.
- Q. Wang and K. Chen, “Alternative semantic representations for zero-shot human action recognition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2017, pp. 87–102.
- J. Gao, T. Zhang, and C. Xu, “Learning to model relationships for zero-shot video classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3476–3491, 2021.
- J. Zhuo, Y. Zhu, S. Cui, S. Wang, B. MA, Q. Huang, X. Wei, and X. Wei, “Zero-shot video classification with appropriate web and task knowledge transfer,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5761–5772.
- Jiaming Zhou (41 papers)
- Junwei Liang (47 papers)
- Kun-Yu Lin (24 papers)
- Jinrui Yang (8 papers)
- Wei-Shi Zheng (148 papers)