AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving (2403.17373v1)
Abstract: Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and LLMs to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
- Tesla autonomy day, howpublished = https://www.youtube.com/live/ucp0ttmvqoe?si=bwinmhvsuzthivax.
- Cruise’s continuous learning machine predicts the unpredictable on san francisco roads, howpublished = https://medium.com/cruise/cruise-continuous-learning-machine-30d60f4c691b.
- Scaling novel object detection with weakly supervised detection transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 85–96, 2023.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
- Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
- Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9819–9828, 2022.
- Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
- Not all labels are equal: Rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14492–14501, 2022.
- Talisman: targeted active learning for object detection with rare classes and slices using submodular mutual information. In European Conference on Computer Vision, pages 1–16. Springer, 2022.
- Box-level active detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23766–23775, 2023.
- Scaling open-vocabulary object detection. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Introducing chatgpt, howpublished = https://openai.com/blog/chatgpt.
- End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
- Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE international conference on computer vision, pages 1409–1416, 2013.
- Never-ending learning. Communications of the ACM, 61(5):103–115, 2018.
- Segment anything. In ICCV, 2023.
- Auto4d: Learning to label 4d objects from sequential point clouds. arXiv preprint arXiv:2101.06586, 2021.
- Offboard 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6134–6144, 2021.
- Clip model is an efficient continual learner. arXiv preprint arXiv:2210.03114, 2022.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Incremental few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13846–13855, 2020.
- The overlooked elephant of object detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1021–1030, 2020.
- Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 7(2):5453–5460, 2022.
- Learning to detect every thing in an open world. In European Conference on Computer Vision, pages 268–284. Springer, 2022.
- Localized vision-language matching for open-vocabulary object detection. In DAGM German Conference on Pattern Recognition, pages 393–408. Springer, 2022.
- X-detr: A versatile architecture for instance-wise vision-language tasks. In European Conference on Computer Vision, pages 290–308. Springer, 2022.
- Learning object-language alignments for open-vocabulary object detection. In The Eleventh International Conference on Learning Representations, 2023.
- Towards open-set object detection and discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3961–3970, 2022.
- Towards open world object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5830–5840, 2021.
- Discovering objects that can move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11789–11798, 2022.
- Generalized category discovery. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- A unified objective for novel class discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9284–9292, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
- Region-aware pretraining for open-vocabulary object detection with vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11144–11154, 2023.
- Open-vocabulary object detection upon frozen vision and language models. In The Eleventh International Conference on Learning Representations, 2023.
- A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Exploiting unlabeled data with vision and language models for object detection. In European Conference on Computer Vision, pages 159–175. Springer, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
- Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 510–526. Springer, 2020.
- Semi-detr: Semi-supervised object detection with detection transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23809–23818, 2023.
- Consistent-teacher: Towards reducing inconsistent pseudo-targets in semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3240–3249, 2023.
- Open-set semi-supervised object detection. In European Conference on Computer Vision, pages 143–159. Springer, 2022.
- A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40, 2021.
- Just label what you need: Fine-grained active selection for p&p through partially labeled scenes. In Conference on Robot Learning, pages 816–826. PMLR, 2022.
- Improving the intra-class long-tail in 3d detection via rare example mining. In European Conference on Computer Vision, pages 158–175. Springer, 2022.
- Active learning for open-set annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41–49, 2022.
- Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, volume 3, 2003.
- Diverse complexity measures for dataset curation in self-driving. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8609–8616. IEEE, 2021.
- Mixteacher: Mining promising labels with mixed scale teacher for semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7370–7379, 2023.
- Active teacher for semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14482–14491, 2022.
- Semi-supervised batch active learning via bilevel optimization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3495–3499. IEEE, 2021.
- Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE international conference on computer vision, pages 3400–3409, 2017.
- Wanderlust: Online continual object detection in the real world. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10829–10838, 2021.
- Overcoming catastrophic forgetting in incremental object detection via elastic response distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9427–9436, 2022.
- Ordisco: Effective and efficient usage of incremental unlabeled data for semi-supervised continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5383–5392, 2021.
- Memory-efficient semi-supervised continual learning: The world is its own replay buffer. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Continual semi-supervised learning through contrastive interpolation consistency. Pattern Recognition Letters, 162:9–14, 2022.
- A soft nearest-neighbor framework for continual semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11868–11877, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Faster bounding box annotation for object detection in indoor scenes. In 2018 7th European Workshop on Visual Information Processing (EUVIP), pages 1–6. IEEE, 2018.
- GPU price from lambda, howpublished = https://lambdalabs.com/service/gpu-cloud.
- Object detection with a unified label space from multiple datasets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 178–193. Springer, 2020.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.