Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos (2401.03522v2)
Abstract: Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.
- Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, “Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2068–2078, 2022.
- L. Claussmann, M. Revilloud, D. Gruyer, and S. Glaser, “A review of motion planning for highway autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 5, pp. 1826–1848, 2020.
- M. Jeong, B. C. Ko, and J.-Y. Nam, “Early detection of sudden pedestrian crossing for safe driving during summer nights,” IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 6, pp. 1368–1380, 2017.
- L. Yue, M. A. Abdel-Aty, Y. Wu, and A. Farid, “The practical effectiveness of advanced driver assistance systems at different roadway facilities: System limitation, adoption, and usage,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 9, pp. 3859–3870, 2020.
- Y. Yuan, D. Wang, and Q. Wang, “Anomaly detection in traffic scenes via spatial-aware motion reconstruction,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 5, pp. 1198–1209, 2017.
- W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection – a new baseline,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2018, pp. 6536–6545.
- Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, “A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 588–13 597.
- Z. Zhou, X. Dong, Z. Li, K. Yu, C. Ding, and Y. Yang, “Spatio-temporal feature encoding for traffic accident detection in vanet environment,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 19 772–19 781, 2022.
- Y. Yao, X. Wang, M. Xu, Z. Pu, Y. Wang, E. Atkins, and D. J. Crandall, “Dota: Unsupervised detection of traffic anomaly in driving videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 444–459, 2023.
- M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2016, pp. 733–742.
- Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos using spatiotemporal autoencoder,” in Proc. Adv. Neural Networks, 2017, pp. 189–196.
- J. Fang, J. Qiao, J. Bai, H. Yu, and J. Xue, “Traffic accident detection via self-supervised consistency learning in driving scenarios,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 9601–9614, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. conf. mach. learn., vol. 139, 2021, pp. 8748–8763.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proc. Int. conf. mach. learn., vol. 139, 2021, pp. 4904–4916.
- Y. Yang, W. Huang, Y. Wei, H. Peng, X. Jiang, H. Jiang, F. Wei, Y. Wang, H. Hu, L. Qiu, and Y. Yang, “Attentive mask clip,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 2771–2781.
- X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” in Proc. Int. Conf. Learn. Represent., 2022.
- J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2022, pp. 18 134–18 144.
- S. Chen, Q. Xu, Y. Ma, Y. Qiao, and Y. Wang, “Attentive snippet prompting for video retrieval,” IEEE Trans. Multimed., pp. 1–12, 2023.
- M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
- H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
- H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned clip models are efficient video learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 6545–6554.
- J. Fang, D. Yan, J. Qiao, J. Xue, and H. Yu, “Dada: Driver attention prediction in driving accident scenarios,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4959–4971, 2022.
- Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, “Bidirectional spatio-temporal feature learning with multiscale evaluation for video anomaly detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8285–8296, 2022.
- M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, “A background-agnostic framework with adversarial training for abnormal event detection in video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 4505–4523, 2022.
- S. Zhang, M. Gong, Y. Xie, A. K. Qin, H. Li, Y. Gao, and Y.-S. Ong, “Influence-aware attention networks for anomaly detection in surveillance videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5427–5437, 2022.
- X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, “A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 200–212, 2023.
- C. Huang, J. Wen, Y. Xu, Q. Jiang, J. Yang, Y. Wang, and D. Zhang, “Self-supervised attentive generative adversarial networks for video anomaly detection,” IEEE Trans. Neural Netw. Learn. Systems, vol. 34, no. 11, pp. 9389–9403, 2023.
- Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, “Multi-scale continuity-aware refinement network for weakly supervised video anomaly detection,” in Proc. IEEE Int. Conf. Multimedia Expo., 2022, pp. 1–6.
- Y. Yuan, J. Fang, and Q. Wang, “Incrementally perceiving hazards in driving,” Neurocomputing, vol. 282, pp. 202–217, 2018.
- Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsupervised traffic accident detection in first-person videos,” in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2019, pp. 273–280.
- G. Sun, Z. Liu, L. Wen, J. Shi, and C. Xu, “Anomaly crossing: New horizons for video anomaly detection as cross-domain few-shot learning,” arXiv preprint arXiv:2112.06320, 2022.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2017.
- E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2017.
- N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in Proc. IEEE Int. Conf. Image Processing, 2017, pp. 3645–3649.
- R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Trans. Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
- L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 23 497–23 506.
- Z. Zhou, Y. Lei, B. Zhang, L. Liu, and Y. Liu, “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 11 175–11 185.
- A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, “Zero-shot composed image retrieval with textual inversion,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 15 338–15 347.
- M. Tschannen, B. Mustafa, and N. Houlsby, “Clippo: Image-and-language understanding from pixels only,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2023, pp. 11 006–11 017.
- S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-language prompting,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 681–697.
- Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, “X-clip: End-to-end multi-grained contrastive learning for video-text retrieval,” in Proc. ACM Int. Conf. Multi., 2022, p. 638–647.
- W. Wu, Z. Sun, and W. Ouyang, “Revisiting classifier: Transferring vision-language models for video recognition,” in Proc. AAAI Conf. Art. Intel., vol. 37, 2023, pp. 2847–2855.
- B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 1–18.
- P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” arXiv preprint arXiv:2308.11681, 2023.
- R. Zhang, Z. Zeng, Z. Guo, and Y. Li, “Can language understand depth?” in Proc. ACM Int. Conf. Multi., 2022, p. 6868–6874.
- Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, “Iterative prompt learning for unsupervised backlit image enhancement,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 8094–8103.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2022, pp. 16 816–16 825.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2016.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 8, no. 1, pp. 1–9, 2019.
- D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1705–1714.
- S. Li, J. Fang, H. Xu, and J. Xue, “Video frame prediction by deep multi-branch mask network,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1283–1295, 2021.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in In Proc. Int. Conf. Learn. Representat., 2021, pp. 1–22.