Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy (2405.01337v1)
Abstract: Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- React: Recognize every action everywhere all at once. arXiv preprint arXiv:2312.00188, 2023.
- Sogar: Self-supervised spatiotemporal attention-based social group activity recognition. arXiv preprint arXiv:2305.06310, 2023.
- Spartan: Self-supervised spatiotemporal transformers approach to group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5157–5167, 2023.
- Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016.
- Attention consistency on visual corruptions for single-source domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4165–4174, 2022.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Exploiting cyclic symmetry in convolutional neural networks. In International conference on machine learning, pages 1889–1898. PMLR, 2016.
- Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
- Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems, 32, 2019.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.
- Stylenerf: A style-based 3d aware generator for high-resolution image synthesis. In International Conference on Learning Representations, 2022.
- Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 729–739, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
- Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
- Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Resource efficient 3d convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Motionsqueeze: Neural motion feature learning for video understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 345–362. Springer, 2020.
- Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 204–212, 2015.
- Unsupervised deep metric learning with transformed attention consistency and contrastive clustering loss. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 141–157. Springer, 2020.
- Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5741–5751, 2021.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
- Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3889–3898, 2019.
- Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
- Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Atlanta, Georgia, USA, 2013.
- Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5048–5057, 2017.
- The jester dataset: A large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
- Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487, 2011.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Atcon: Attention consistency for vision models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1880–1889, 2023.
- Multi-label co-regularization for semi-supervised facial action unit recognition. Advances in neural information processing systems, 32, 2019.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
- Action recognition with stacked fisher vectors. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 581–595. Springer, 2014.
- Gromov-wasserstein averaging of kernel and distance matrices. In International conference on machine learning, pages 2664–2672. PMLR, 2016.
- Adapting self-supervised vision transformers by probing attention-conditioned masking consistency. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
- Deep co-training for semi-supervised image recognition. In Proceedings of the european conference on computer vision (eccv), pages 135–152, 2018.
- Dyglip: A dynamic graph model with link prediction for accurate multi-camera multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13784–13793, 2021.
- A review on applications of activity recognition systems with regard to performance and evaluation. International Journal of Distributed Sensor Networks, 12(8):1550147716665520, 2016.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
- Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
- Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (ToG), 35(4):1–13, 2016.
- Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334, 2018.
- Human action recognition from various data modalities: A review. arXiv preprint arXiv:2012.11866, 2020.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12959–12970, 2021.
- Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192, 2021.
- Direcformer: A directed attention in transformer approach to robust action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20030–20040, 2022.
- Cross-view action recognition understanding from exocentric to egocentric perspective. arXiv preprint arXiv:2305.15699, 2023.
- Falcon: Fairness learning via contrastive attention approach to continual semantic scene understanding in open world. arXiv preprint arXiv:2311.15965, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Action recognition by dense trajectories. In CVPR 2011, pages 3169–3176, 2011.
- Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
- Cross-view action modeling, learning and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2649–2656, 2014.
- Sharpen focus: Learning with attention separability and consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 512–521, 2019.
- Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- NeRF−−--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
- Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 567–584, 2018.
- Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5028–5037, 2017.
- Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
- Neural fields in visual computing and beyond. In Computer Graphics Forum, volume 41, pages 641–676. Wiley Online Library, 2022.
- Exploiting attention-consistency loss for spatial-temporal stream action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2s):1–15, 2022.
- inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
- Actionformer: Localizing moments of actions with transformers. arXiv preprint arXiv:2202.07925, 2022.
- Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13703–13712, 2021.
- Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
- Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923, 2017.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.