Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos (2401.12592v3)

Published 23 Jan 2024 in cs.CV

Abstract: We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing, the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos, which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks, real-world scale camera poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis, camera pose estimation, object 6d pose estimation, and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is https://wildrgbd.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, 120:153–168, 2016.
  2. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831, 2021.
  3. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, 2021.
  4. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
  5. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  6. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021a.
  7. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022.
  8. Learning canonical shape space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11973–11982, 2020a.
  9. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1581–1590, 2021b.
  10. Category level object pose estimation via neural analysis-by-synthesis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 139–156. Springer, 2020b.
  11. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision, pages 640–658. Springer, 2022.
  12. Segment and track anything, 2023.
  13. A large dataset of object scans. arXiv preprint arXiv:1602.02481, 2016.
  14. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
  15. Improving neural implicit surfaces geometry with patch warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6260–6269, 2022.
  16. Objaverse: A universe of annotated 3d objects, 2022.
  17. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  18. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  19. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006.
  20. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  21. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  22. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  23. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset, 2022.
  24. 6d object pose regression via supervised learning on point clouds. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3643–3649. IEEE, 2020.
  25. Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
  26. Deep residual learning for image recognition, 2015.
  27. Towards self-supervised category-level object pose and size estimation. arXiv preprint arXiv:2203.02884, 2022.
  28. Unsupervised learning of 3d object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700–4709, 2021.
  29. Few-view object reconstruction with unknown categories and camera poses. arXiv preprint arXiv:2212.04492, 2022.
  30. Housecat6d – a large-scale multi-modal category level 6d object pose dataset with household objects in realistic scenarios, 2023.
  31. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9799–9808, 2020.
  32. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  33. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2992–2999, 2013.
  34. Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
  35. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3560–3569, 2021.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  37. Single-stage keypoint-based category-level object pose estimation from an rgb image. In 2022 International Conference on Robotics and Automation (ICRA), pages 1547–1553. IEEE, 2022a.
  38. Neurmips: Neural mixture of planar experts for view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15702–15712, 2022b.
  39. Zero-1-to-3: Zero-shot one image to 3d object, 2023a.
  40. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  41. Stereobj-1m: Large-scale stereo image dataset for 6d object pose estimation, 2022a.
  42. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824–7833, 2022b.
  43. An iterative image registration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, pages 674–679, 1981.
  44. Cps++: Improving class-level 6d pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848, 2020.
  45. Relative camera pose estimation using convolutional neural networks. In Advanced Concepts for Intelligent Vision Systems: 18th International Conference, ACIVS 2017, Antwerp, Belgium, September 18-21, 2017, Proceedings 18, pages 675–687. Springer, 2017.
  46. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  47. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16190–16199, 2022.
  48. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 2022.
  49. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
  50. Self-supervised category-level 6d object pose estimation with deep implicit shape representation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2082–2090, 2022.
  51. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017.
  52. Learning transferable visual models from natural language supervision, 2021.
  53. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
  54. The 8-point algorithm as an inductive bias for relative pose prediction by vits. In 2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2022.
  55. Structure-from-motion revisited. In CVPR, 2016.
  56. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  57. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 134–144, 2019.
  58. Sparsepose: Sparse-view camera pose regression and refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21349–21359, 2023.
  59. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  60. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
  61. Shape prior deformation for categorical 6d object pose and size estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 530–546. Springer, 2020.
  62. Bundle adjustment—a modern synthesis. In Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings, pages 298–372. Springer, 2000.
  63. Learning category-specific deformable 3d models for object reconstruction. IEEE transactions on pattern analysis and machine intelligence, 39(4):719–731, 2016.
  64. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
  65. Ref-nerf: Structured view-dependent appearance for neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5481–5490. IEEE, 2022.
  66. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019.
  67. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
  68. Phocal: A multi-modal dataset for category-level object pose estimation with photometrically challenging objects, 2022.
  69. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021b.
  70. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2043–2050. IEEE, 2017.
  71. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  72. Voxurf: Voxel-based efficient and accurate neural surface reconstruction. arXiv preprint arXiv:2208.12697, 2022.
  73. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
  74. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  75. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014.
  76. Track anything: Segment anything meets videos, 2023.
  77. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1281–1292, 2020.
  78. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020.
  79. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  80. Cppf: Towards robust category-level 9d pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6866–6875, 2022.
  81. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  82. Mvimgnet: A large-scale dataset of multi-view images, 2023.
  83. Sdfstudio: A unified framework for surface reconstruction, 2022.
  84. Relpose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision, pages 592–611. Springer, 2022a.
  85. Self-supervised geometric correspondence for category-level 6d object pose estimation in the wild. arXiv preprint arXiv:2210.07199, 2022b.
  86. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  87. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com