Abductive Ego-View Accident Video Understanding for Safe Driving Perception (2403.00436v1)
Abstract: We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.
- VIENA: A driving anticipation dataset. In ACCV, pages 449–466, 2019.
- Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In ACM MM, pages 2682–2690, 2020.
- DRIVE: deep reinforced accident anticipation with visual explanation. In ICCV, pages 7599–7608, 2021.
- End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
- Anticipating accidents in dashcam videos. In ACCV, volume 10114, pages 136–153, 2016.
- Kai Chen et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Diffusiondet: Diffusion model for object detection. In ICCV, pages 19830–19843, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Centernet: Keypoint triplets for object detection. In ICCV, pages 6569–6578, 2019.
- Editorial. Safe driving cars. Nat. Mach. Intell., 4:95–96, 2022.
- Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
- Traffic accident detection via self-supervised consistency learning in driving scenarios. IEEE Trans. Intell. Transp. Syst., 23(7):9601–9614, 2022.
- DADA-2000: can driving accident be predicted by driver attention? Analyzed by A benchmark. In ITSC, pages 4303–4309, 2019.
- DADA: driver attention prediction in driving accident scenarios. IEEE Trans. Intell. Transp. Syst., 23(6):4959–4971, 2022.
- YOLOX: exceeding YOLO series in 2021. CoRR, abs/2107.08430, 2021.
- Accident detection using convolutional neural networks. In IconDSC, pages 1–6, 2019.
- Vision transformers for road accident detection from dashboard cameras. In AVSS, pages 1–8, 2022.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Cost-sensitive semi-supervised deep learning to assess driving risk by application of naturalistic vehicle trajectories. Expert Syst. Appl., 178:115041, 2021.
- Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In ICRA, pages 3118–3125, 2016.
- Glenn Jocher et al. ultralytics/yolov5: v6.2 - YOLOv5 Classification Models, Apple M1, Reproducibility, ClearML and Deci.ai integrations, 2022.
- Vision transformer for detecting critical situations and extracting functional scenario for automated vehicle safety assessment. Sustainability, 14(15):9680, 2022.
- A dynamic spatial-temporal attention network for early anticipation of traffic accidents. IEEE Trans. Intell. Transp. Syst., 23(7):9590–9600, 2022.
- An attention-guided multistream feature fusion network for early localization of risky traffic agents in driving videos. IEEE Trans. Intell. Veh. in Press, 2023.
- Crash to not crash: Learn to identify dangerous vehicles using a simulator. In AAAI, pages 978–985, 2019.
- Bulbula Kumeda et al. Vehicle accident and traffic classification using deep convolutional neural networks. In International Computer Conference on Wavelet Active Media Technology and Information Processing, pages 323–328, 2019.
- Cornernet: Detecting objects as paired keypoints. In ECCV, pages 765–781, 2018.
- Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981, 2020.
- Attention R-CNN for accident detection. In IV, pages 313–320, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, pages 7331–7341, 2021.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Microsoft COCO: Common objects in context. In ECCV, pages 740–755, 2014.
- Temporal shift and spatial attention-based two-stream network for traffic risk assessment. IEEE Trans. Intell. Transp. Syst., 23(8):12518–12530, 2022.
- Cross-modal causal relational reasoning for event-level visual question answering. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11624–11641, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- A simulation-based framework for urban traffic accident detection. In ICASSP, pages 1–5, 2023.
- Spatiotemporal scene-graph embedding for autonomous vehicle collision prediction. IEEE Internet Things J., 9(12):9379–9388, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
- Deep learning based detection and localization of road accidents from traffic surveillance videos. ICT Express, 8(3):379–387, 2022.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS, 28, 2015.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Detection of collision-prone vehicle behavior at intersections using siamese interaction lstm. IEEE Trans. Intell. Transp. Syst., 23(4):3137–3147, 2020.
- Vehicular trajectory classification and traffic anomaly detection in videos using a hybrid cnn-vae architecture. IEEE Trans. Intell. Transp. Syst., 23(8):11891–11902, 2021.
- Deep spatio-temporal representation for detection of road accidents using stacked autoencoder. IEEE Trans. Intell. Transp. Syst., 20(3):879–887, 2019.
- Denoising diffusion implicit models. In ICLR, 2021.
- Anticipating traffic accidents with adaptive loss and large-scale incident DB. In CVPR, pages 3521–3529, 2018.
- Classification of crash and near-crash events from dashcam videos and telematics. In ITSC, pages 2460–2465, 2018.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, ICML, volume 97, pages 6105–6114, 2019.
- Ultralytics. Ultralytics github repository. https://github.com/ultralytics/ultralytics, November 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Detection of road accidents using synthetically generated multi-perspective accident videos. IEEE Trans. Intell. Transp. Syst., 24(2):1926–1935, 2023.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 35:23371–23385, 2022.
- Cascade RPN: delving into high-quality region proposal network with adaptive convolution. In NeurIPS, pages 1430–1440, 2019.
- GSC: A graph and spatio-temporal continuity based framework for accident anticipation. IEEE Trans. Intell. Veh. in Press, 2023.
- Deepaccident: A motion and accident prediction benchmark for V2X autonomous driving. CoRR, abs/2304.01168, 2023.
- Tune-a-Video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023.
- Can i trust your answer? visually grounded video question answering. arXiv preprint arXiv:2309.01327, 2023.
- Video graph transformer for video question answering. In ECCV, pages 39–58, 2022.
- Contrastive video question answering via video graph transformer. IEEE T-PAMI, 45(11):13265–13280, 2023.
- SUTD-TrafficQA: A question answering benchmark and an efficient network for video reasoning over traffic events. In CVPR, pages 9878–9888, 2021.
- Dota: Unsupervised detection of traffic anomaly in driving videos. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):444–459, 2023.
- Unsupervised traffic accident detection in first-person videos. In IROS, pages 273–280, 2019.
- Traffic accident benchmark for causality recognition. In ECCV, volume 12352, pages 540–556, 2020.
- BDD100K: A diverse driving dataset for heterogeneous multitask learning. In CVPR, pages 2633–2642, 2020.
- Self-chained image-language model for video localization and question answering. NeurIPS, 2023.
- Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR, pages 19027–19036, 2023.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
- Spatio-temporal feature encoding for traffic accident detection in vanet environment. IEEE Trans. Intell. Transp. Syst., 23(10):19772–19781, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.