The 8th AI City Challenge (2404.09432v1)
Abstract: The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
- Slicing aided hyper inference and fine-tuning for small object detection. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, Oct. 2022.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
- Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- An effective method for detecting violation of helmet rule for motorcyclists. In CVPR Workshop, Seattle, WA, USA, 2024.
- Dual aggregation transformer for image super-resolution, 2023.
- Yolo-world: Real-time open-vocabulary object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Trafficvlm: A controllable visual language model for traffic video captioning. In CVPR Workshop, Seattle, WA, USA, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Giaotracker: A comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 2809–2819, October 2021.
- Cityllava: Efficient fine-tuning for vlms in city scenario. In CVPR Workshop, Seattle, WA, USA, 2024.
- Robust data augmentation and ensemble method for object detection in fisheye camera images. In CVPR Workshop, Seattle, WA, USA, 2024.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- X3d: Expanding architectures for efficient video recognition. arXiv preprint arXiv:2004.04730, 2020.
- Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
- Enhancing road object detection in fisheye cameras: An effective framework integrating sahi and hybrid inference. In CVPR Workshop, Seattle, WA, USA, 2024.
- Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
- Fisheye8k: A benchmark and dataset for fisheye camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5304–5312, June 2023.
- FishEye8K: A benchmark and dataset for fisheye camera object detection. In CVPR Workshop, 2023.
- Global structure-aware diffusion process for low-light image enhancement, 2023.
- Bot-sort: Robust associations multi-pedestrian tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
- Ultralytics yolov8, 2023.
- Arne Hoffhues Jonathon Luiten. Trackeval. https://github.com/JonathonLuiten/TrackEval, 2020.
- Conftrack: Kalman filter-based multi-person tracking by utilizing confidence score of detection box. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6583–6592, 2024.
- Cluster self-refinement for enhanced online multi-camera people tracking. In CVPR Workshop, Seattle, WA, USA, 2024.
- Wts: A pedestrian-centric traffic video dataset for fine-grained spatial-temporal understanding. 2024.
- Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022.
- Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Video-llava: Learning united visual representation by alignment before projection, 2023.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Oct. 2021.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, pages 1–31, 2020.
- Fe-det: An effective traffic object detection framework for fish-eye cameras. In CVPR Workshop, Seattle, WA, USA, 2024.
- Motorcyclist helmet violation detection framework by leveraging robust ensemble and augmentation methods. In CVPR Workshop, Seattle, WA, USA, 2024.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:XXXX.XXXXX, 202X.
- The 2018 NVIDIA AI City Challenge. In CVPR Workshop, pages 53––60, 2018.
- The 2019 AI City Challenge. In CVPR Workshop, page 452–460, 2019.
- The 6th ai city challenge. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3346–3355, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.
- The 5th AI City Challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021.
- The 4th AI City Challenge. In CVPR Workshop, 2020.
- The 7th AI City Challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023.
- The 6th AI City Challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2022.
- Multi-view spatial-temporal learning for understanding unusual behaviors in untrimmed naturalistic driving videos. In CVPR Workshop, Seattle, WA, USA, 2024.
- CIDEr-R: Robust consensus-based image description evaluation. In Wei Xu, Alan Ritter, Tim Baldwin, and Afshin Rahimi, editors, Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 351–360, Online, Nov. 2021. Association for Computational Linguistics.
- OpenAI. GPT-3.5, 2023.
- Nms strikes back. arXiv preprint arXiv:2212.06137, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
- Visdrone-vid2019: The vision meets drone object detection in video challenge results. 2019.
- Improving object detection to fisheye cameras with open-vocabulary pseudo-label approach. In CVPR Workshop, Seattle, WA, USA, 2024.
- Learning transferable visual models from natural language supervision, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Deeplocalization: Using change point detection for temporal action localization. In CVPR Workshop, Seattle, WA, USA, 2024.
- Synthetic distracted driving (syndd2) dataset for analyzing distracted behaviors and various gaze zones of a driver, 2023.
- Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1214–1223, 2021.
- Road object detection robust to distorted objects at the edge regions of images. In CVPR Workshop, Seattle, WA, USA, 2024.
- Enhancing traffic safety with parallel dense video captioning for end-to-end event analysis. In CVPR Workshop, Seattle, WA, USA, 2024.
- Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107:104117, Mar. 2021.
- Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing, 107:104117, 2021.
- Andreas Specker. Ocmctrack: Online multi-target multi-camera tracking with corrective matching cascade. In CVPR Workshop, Seattle, WA, USA, 2024.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Online multi-camera people tracking with spatial-temporal mechanism and anchor-feature hierarchical clustering. In CVPR Workshop, Seattle, WA, USA, 2024.
- Multi-perspective traffic video description model with fine-grained refinement approach. In CVPR Workshop, Seattle, WA, USA, 2024.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
- Low-light image enhancement framework for improved object detection in fisheye lens datasets. In CVPR Workshop, Seattle, WA, USA, 2024.
- Divide and conquer boosting for enhanced traffic safety description and analysis with large vision language model. In CVPR Workshop, Seattle, WA, USA, 2024.
- Efficient online multi-camera tracking with memory-efficient accumulated appearance features and trajectory validation. In CVPR Workshop, Seattle, WA, USA, 2024.
- Robust motorcycle helmet detection in real-world scenarios: Using co-detr and minority class enhancement. In CVPR Workshop, Seattle, WA, USA, 2024.
- Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023.
- You only learn one representation: Unified network for multiple tasks, 2021.
- Yolov9: Learning what you want to learn using programmable gradient information, 2024.
- Box-grained reranking matching for multi-camera multi-target tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Exploiting diffusion prior for real-world image super-resolution, 2023.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions, 2023.
- Self-supervised pre-training for transformer-based person re-identification. arXiv preprint arXiv:2103.04553, 2021.
- A robust online multi-camera people tracking system with geometric consistency and state-aware re-id correction. In CVPR Workshop, Seattle, WA, USA, 2024.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning, 2023.
- An online approach and evaluation method for tracking people across cameras in extremely long video sequence. In CVPR Workshop, Seattle, WA, USA, 2024.
- Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
- Overlap suppression clustering for offline multi-camera people tracking. In CVPR Workshop, Seattle, WA, USA, 2024.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Multi-view action recognition for distracted driver behavior localization. In CVPR Workshop, Seattle, WA, USA, 2024.
- Actionformer: Localizing moments of actions with transformers. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 492–510, Cham, 2022. Springer Nature Switzerland.
- A coarse-to-fine two-stage helmet detection method for motorcyclists. In CVPR Workshop, Seattle, WA, USA, 2024.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
- Augmented self-mask attention transformer for naturalistic driving action recognition. In CVPR Workshop, Seattle, WA, USA, 2024.
- Unsupervised pre-training for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
- Omni-scale feature learning for person re-identification. CVPR, 2020.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Oct. 2017.
- Detrs with collaborative hybrid assignments training, 2022.
- Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023.