Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D Data (2310.13398v1)

Published 20 Oct 2023 in cs.CV

Abstract: In the era of big data and large models, automatic annotating functions for multi-modal data are of great significance for real-world AI-driven applications, such as autonomous driving and embodied AI. Unlike traditional closed-set annotation, open-vocabulary annotation is essential to achieve human-level cognition capability. However, there are few open-vocabulary auto-labeling systems for multi-modal 3D data. In this paper, we introduce OpenAnnotate3D, an open-source open-vocabulary auto-labeling system that can automatically generate 2D masks, 3D masks, and 3D bounding box annotations for vision and point cloud data. Our system integrates the chain-of-thought capabilities of LLMs and the cross-modality capabilities of vision-LLMs (VLMs). To the best of our knowledge, OpenAnnotate3D is one of the pioneering works for open-vocabulary multi-modal 3D auto-labeling. We conduct comprehensive evaluations on both public and in-house real-world datasets, which demonstrate that the system significantly improves annotation efficiency compared to manual annotation while providing accurate open-vocabulary auto-annotating results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  2. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  3. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  4. J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9297–9307.
  5. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  6. OpenAI, “Gpt-4 technical report,” 2023.
  7. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  8. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  9. S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser, et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 815–824.
  10. Tesla, “Tesla ai day 2022,” 2000. [Online]. Available: https://www.youtube.com/watch?v=ODSJsviD˙SU&ab˙channel=Tesla
  11. OpenAI, “Chatgpt (september 17 version),” https://chat.openai.com/chat, 2023.
  12. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a database and web-based tool for image annotation,” International journal of computer vision, vol. 77, pp. 157–173, 2008.
  13. C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scaling up crowdsourced video annotation: A set of best practices for high quality, economical video labeling,” International journal of computer vision, vol. 101, pp. 184–204, 2013.
  14. M. Tkachenko, M. Malyuk, A. Holmanyuk, and N. Liubimov, “Label Studio: Data labeling software,” 2020-2022, open source software available from https://github.com/heartexlabs/label-studio. [Online]. Available: https://github.com/heartexlabs/label-studio
  15. A. Dutta and A. Zisserman, “The VIA annotation software for images, audio and video,” in Proceedings of the 27th ACM International Conference on Multimedia, ser. MM ’19.   New York, NY, USA: ACM, 2019. [Online]. Available: https://doi.org/10.1145/3343031.3350535
  16. K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool, “Deep extreme cut: From extreme points to object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 616–625.
  17. D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive annotation of segmentation datasets with polygon-rnn++,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868.
  18. B. Sekachev, M. Nikita, and Z. Andrey, “Computer vision annotation tool: A universal approach to data annotation. 2019,” URL https://github.com/opencv/cvat, 2019.
  19. B. Dwyer, J. Nelson, J. Solawetz, and et al., “Roboflow (version 1.0),” Available from https://roboflow.com, 2022, computer Vision.
  20. Labelbox. (2023) Labelbox. [Online]. [Online]. Available: https://labelbox.com
  21. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  22. A. Golovinskiy and T. Funkhouser, “Min-cut based segmentation of point clouds,” in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.   IEEE, 2009, pp. 39–46.
  23. R. Monica, J. Aleotti, M. Zillich, and M. Vincze, “Multi-label point cloud annotation by selection of sparse control points,” in 2017 International Conference on 3D Vision (3DV).   IEEE, 2017, pp. 301–308.
  24. T. Kontogianni, E. Celikkan, S. Tang, and K. Schindler, “Interactive object segmentation in 3d point clouds,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 2891–2897.
  25. B. Wang, V. Wu, B. Wu, and K. Keutzer, “Latte: accelerating lidar point cloud annotation via sensor fusion, one-click annotation, and tracking,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC).   IEEE, 2019, pp. 265–272.
  26. F. Piewak, P. Pinggera, M. Schafer, D. Peter, B. Schwarz, N. Schneider, M. Enzweiler, D. Pfeiffer, and M. Zollner, “Boosting lidar-based semantic labeling by cross-modal training data generation,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
  27. Y. Zhang, M. Fukuda, Y. Ishii, K. Ohshima, and T. Yamashita, “Palf: Pre-annotation and camera-lidar late fusion for the easy annotation of point clouds,” arXiv preprint arXiv:2304.08591, 2023.
  28. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  29. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  30. OpenAI. (2023) OpenAI. [Online]. Available: https://www.openai.com
  31. H. Chase, “Langchain,” Available at https://github.com/hwchase17/langchain, 2022, cff-version: 1.2.0.
  32. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.