Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors (2403.06093v1)

Published 10 Mar 2024 in cs.CV

Abstract: Multi-camera-based 3D object detection has made notable progress in the past several years. However, we observe that there are cases (e.g. faraway regions) in which popular 2D object detectors are more reliable than state-of-the-art 3D detectors. In this paper, to improve the performance of query-based 3D object detectors, we present a novel query generating approach termed QAF2D, which infers 3D query anchors from 2D detection results. A 2D bounding box of an object in an image is lifted to a set of 3D anchors by associating each sampled point within the box with depth, yaw angle, and size candidates. Then, the validity of each 3D anchor is verified by comparing its projection in the image with its corresponding 2D box, and only valid anchors are kept and used to construct queries. The class information of the 2D bounding box associated with each query is also utilized to match the predicted boxes with ground truth for the set-based loss. The image feature extraction backbone is shared between the 3D detector and 2D detector by adding a small number of prompt parameters. We integrate QAF2D into three popular query-based 3D object detectors and carry out comprehensive evaluations on the nuScenes dataset. The largest improvement that QAF2D can bring about on the nuScenes validation subset is $2.3\%$ NDS and $2.7\%$ mAP. Code is available at https://github.com/nuLLMax-vision/QAF2D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  2. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11618–11628, 2020.
  3. Monopair: Monocular 3d object detection using pairwise spatial relationships. In CVPR, pages 12093–12102, 2020.
  4. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, pages 21674–21683, 2023.
  5. Lpt: Long-tailed prompt tuning for image classification. In ICLR, 2023.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  7. Embracing single stride 3d object detector with sparse transformer. In CVPR, pages 8458–8468, 2022.
  8. E2vpt- an effective and efficient approach for visual prompt tuning. In ICCV, pages 17491–17502, 2023.
  9. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  10. Visual prompt tuning. In ECCV, pages 709––727, 2022.
  11. Far3d: Expanding the horizon for surround-view 3d object detection. arXiv preprint arXiv:2308.09616, 2023a.
  12. Polarformer: Multi-camera 3d object detection with polar transformers. In AAAI, pages 1042–1050, 2023b.
  13. An energy and gpu-computation efficient backbone network for real-time object detection. In CVPRW, pages 752–760, 2019.
  14. Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, pages 13619–13627, 2022a.
  15. Dfa3d: 3d deformable attention for 2d-to-3d feature lifting. In ICCV, pages 6684–6693, 2023a.
  16. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI, pages 1486–1494, 2023b.
  17. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, pages 1477–1485, 2023c.
  18. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18, 2022b.
  19. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In ICCV, pages 18580–18590, 2023a.
  20. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, pages 1–35, 2023b.
  21. Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2022a.
  22. Detection transformer with stable matching. In ICCV, pages 6491–6500, 2023c.
  23. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548, 2022b.
  24. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256, 2022c.
  25. Delving into localization errors for monocular 3d object detection. In CVPR, pages 4721–4730, 2021.
  26. Conditional detr for fast training convergence. In ICCV, pages 3651–3660, 2021.
  27. End-to-end object detection with transformers. In ECCV, page 213–229, 2020.
  28. Automatic differentiation in PyTorch. In NeurIPS, 2017.
  29. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
  30. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  31. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE TPAMI, pages 1137–1149, 2017.
  32. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, pages 426–442, 2022.
  33. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In ICCV, pages 3621–3631, 2023a.
  34. FCOS3D: Fully convolutional one-stage monocular 3d object detection. In ICCVW, pages 913–922, 2021a.
  35. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In COLR, pages 180–191, 2021b.
  36. Anchor detr: Query design for transformer-based detector. In AAAI, pages 2567–2575, 2022.
  37. Object as query: Equipping any 2d object detector with 3d detection ability. In ICCV, pages 3791–3800, 2023b.
  38. Cape: Camera view position embedding for multi-view 3d object detection. In CVPR, pages 21570–21579, 2023.
  39. Monodetr: Depth-guided transformer for monocular 3d object detection. In CVPR, pages 9155–9166, 2023a.
  40. A simple baseline for multi-camera 3d object detection. In AAAI, page 3507–3515, 2023b.
  41. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com