Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes (2311.17948v2)
Abstract: In this paper, we study multi-label atomic activity recognition. Despite the notable progress in action recognition, it is still challenging to recognize atomic activities due to a deficiency in a holistic understanding of both multiple road users' motions and their contextual information. In this paper, we introduce Action-slot, a slot attention-based approach that learns visual action-centric representations, capturing both motion and contextual information. Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur, without the need for explicit perception guidance. To further enhance slot attention, we introduce a background slot that competes with action slots, aiding the training process in avoiding unnecessary focus on background regions devoid of activities. Yet, the imbalanced class distribution in the existing dataset hampers the assessment of rare activities. To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS and features a balanced distribution of atomic activities. To validate the effectiveness of our method, we conduct comprehensive experiments and ablation studies against various action recognition baselines. We also show that the performance of multi-label atomic activity recognition on real-world datasets can be improved by pretraining representations on TACO. We will release our source code and dataset. See the videos of visualization on the project page: https://hcis-lab.github.io/Action-slot/
- CARLA Autonomous Driving Challenge. https://carlachallenge.org/, 2022.
- ScenarioRunner for CARLA. https://github.com/carla-simulator/scenario_runner, 2023.
- Ordered atomic activity for fine-grained interactive traffic scenario understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8624–8636, 2023.
- ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021.
- Discovering Objects that Can Move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11789–11798, 2022.
- Object Level Visual Reasoning in Videos. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Is space-time attention all you need for video understanding? In International Conference on Machine Learning (ICML), 2021.
- Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. International Conference on Machine Learning (ICML), 2023.
- nuscenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- AdvDO: Realistic Adversarial Attacks for Trajectory Prediction. In European Conference on Computer Vision (ECCV). Springer, 2022.
- End-to-end Object Detection with Transformers. In European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020.
- Quo vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017.
- Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, 2016.
- Learning from all vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17222–17231, 2022.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
- Neat: Neural attention fields for end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15793–15803, 2021.
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
- What are they doing?: Collective activity classification using spatio-temporal relationship among people. In IEEE International Conference on Computer Vision Workshops (ICCV-W), pages 1282–1289, 2009.
- Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.
- Imagenet: A Large-scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
- Causalaf: causal autoregressive flow for safety-critical driving scenario generation. In Conference on Robot Learning, pages 812–823. PMLR, 2023.
- CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR), 2021.
- SAVi++: Towards End-to-End Object-centric Learning from Real-world Videos. Conference on Neural Information Processing Systems (NeurIPS), 2022.
- PyTorchVideo: A Deep Learning Library for Video Understanding. In Proceedings of the 29th ACM International Conference on Multimedia, 2021a. https://pytorchvideo.org/.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, 2021b.
- Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 203–213, 2020.
- SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202–6211, 2019.
- Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6047–6056, 2018.
- Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023.
- King: Generating safety-critical driving scenarios for robust imitation via kinematics gradients. In European Conference on Computer Vision, pages 335–352. Springer, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Mask R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2961–2969, 2017.
- Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1971–1980, 2016.
- Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models. In IEEE International Conference on Computer Vision (ICCV), 2015.
- THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Conditional Object-Centric Learning from Video. In International Conference on Learning Representations (ICLR), 2022.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- HMDB: A Large Video Database for Human Motion Recognition. In 2011 International Conference on Computer Vision (ICCV), pages 2556–2563, 2011.
- Riskbench: A scenario-based benchmark for risk identification. arXiv preprint arXiv:2312.01659, 2023.
- DBUS: Human Driving Behavior Understanding System. In ICCV Workshops, pages 2436–2444, 2019.
- Visual Semantic Search: Retrieving Videos via Complex Textual Queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
- Spatiotemporal relationship reasoning for pedestrian intent prediction. pages 3485–3492, 2020.
- Object-centric Learning with Slot Attention. Advances in Neural Information Processing Systems (NeurIPS), 33:11525–11538, 2020.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019.
- Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. arXiv preprint arXiv:2302.11813, 2023.
- TITAN: Future Forecast using Action Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11186–11196, 2020.
- Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14424–14432, 2020.
- Egoenv: Human-centric environment representations from egocentric video. arXiv preprint arXiv:2207.11365, 2022.
- The 7th AI City Challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023.
- Learning situational driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7699–7707, 2018.
- PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction. In The IEEE International Conference on Computer Vision (ICCV), 2019.
- Generating useful accident-prone driving scenarios via a learned traffic prior. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Plant: Explainable planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022.
- Object scene representation transformer. In Advances in Neural Information Processing Systems, pages 9512–9524. Curran Associates, Inc., 2022.
- Universal Embeddings for Spatio-Temporal Tagging of Self-Driving Logs. In Conference on Robot Learning (CoRL), 2020.
- Reasonnet: End-to-end driving with temporal and global reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2023.
- Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 510–526. Springer, 2016.
- ROAD: The ROad event Awareness Dataset for Autonomous Driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1036–1054, 2022.
- Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Shift: a synthetic driving dataset for continuous multi-task domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21371–21382, 2022.
- Boosting Standard Classification Architectures Through a Ranking Regularizer. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.
- Egodistill: Egocentric head motion distillation for efficient video understanding. arXiv preprint arXiv:2301.02217, 2023.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Video Classification with Channel-separated Convolutional Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5552–5561, 2019.
- Attention is all you need. Conference on Neural Information Processing Systems, 30, 2017.
- Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
- Learning Actor Relation Graphs for Group Activity Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Distribution-balanced loss for multi-label classification in long-tailed datasets. In European Conference on Computer Vision (ECCV), 2020.
- Learning Road Scene-level Representations via Semantic Region Prediction. In Conference on Robot Learning (CoRL), 2022.
- Safebench: A benchmarking platform for safety evaluation of autonomous vehicles. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 591–600, 2020.
- Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 2017.
- Coaching a teachable student. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7805–7815, 2023.
- Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3093–3103, 2022.
- Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863–17873, 2023.