Rethinking Inductive Biases for Surface Normal Estimation (2403.00712v1)
Abstract: Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In ICCV, pages 13137–13146, 2021.
- Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty. arXiv preprint arXiv:2210.03676, 2022.
- Marr revisited: 2d-3d alignment via surface normal prediction. In CVPR, pages 5965–5974, 2016.
- Found: Foot optimization with uncertain normals for surface deformation using synthetic data. arXiv preprint arXiv:2310.18279, 2023.
- A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), Part VI, pages 611–625, 2012.
- Using vanishing points for camera calibration. International journal of computer vision, 4(2):127–139, 1990.
- Oasis: A large-scale dataset for single image 3d in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 679–688, 2020.
- On the properties of neural machine translation: Encoder-decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014.
- The manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NeurIPS, pages 845–851, 2000.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
- Surface normal estimation of tilted images via spatial rectifier. In Proceedings of the European Conference on Computer Vision (ECCV), Part IV, pages 265–280, 2020.
- Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015.
- Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, pages 2366–2374, 2014.
- Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In CVPR, pages 11826–11835, 2019.
- Data-driven 3d primitives for single image understanding. In ICCV, pages 3392–3399, 2013.
- Unfolding an indoor origami world. In Proceedings of the European Conference on Computer Vision (ECCV), Part VI, pages 687–702, 2014.
- Virtual worlds as proxy for multi-object tracking analysis. In CVPR, pages 4340–4349, 2016.
- Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
- Automatic photo pop-up. In ACM SIGGRAPH, pages 577–584. 2005.
- Recovering surface layout from an image. IJCV, 75:151–172, 2007.
- Piecewise smooth surface reconstruction. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 295–302, 1994.
- Tour into the picture: using a spidery mesh interface to make animation from a single image. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 225–232, 1997.
- Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1418–1428, 2021.
- Deepmvs: Learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Numerical shape from shading and occluding boundaries. Artificial intelligence, 17(1-3):141–184, 1981.
- 3d common corruptions and data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18963–18974, 2022.
- Evaluation of cnn-based single-image depth estimation methods. 2018.
- Extraction, matching, and pose recovery based on dominant rectangular structures. Computer Vision and Image Understanding, 100(3):274–293, 2005.
- Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1106–1114, 2012.
- Sparc: Sparse render-and-compare for cad model alignment in a single rgb image. arXiv preprint arXiv:2210.01044, 2022.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition, pages 2136–2143. IEEE, 2009.
- Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- David Marr. Analysis of occluding contour. Proceedings of the Royal Society of London. Series B. Biological Sciences, 197(1129):441–475, 1977.
- 3d ken burns effect from a single image. ACM Transactions on Graphics (ToG), 38(6):1–15, 2019.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Geonet: Geometric neural network for joint depth and surface normal estimation. In CVPR, pages 283–291, 2018.
- Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Part III, pages 234–241, 2015.
- Clear grasp: 3d shape estimation of transparent objects for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3634–3642. IEEE, 2020.
- Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), Part V, pages 746–760, 2012.
- Super-convergence: Very fast training of residual networks using large learning rates. arXiv preprint arXiv:1708.07120, 2018.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Vplnet: Deep single view normal estimation with vanishing points and lines. In CVPR, pages 689–698, 2020a.
- Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020b.
- Designing deep networks for surface normal estimation. In CVPR, pages 539–547, 2015.
- Icon: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13286–13296. IEEE, 2022.
- Econ: Explicit clothed humans optimized via normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 512–523, 2023.
- Transformer-based attention networks for continuous pixel-wise prediction. In ICCV, pages 16269–16279, 2021.
- Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
- Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018.
- Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11197–11206, 2020.
- Monograspnet: 6-dof grasping with a single rgb image. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1708–1714. IEEE, 2023.
- Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 340–349, 2018.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Nicer-slam: Neural implicit scene encoding for rgb slam. arXiv preprint arXiv:2302.03594, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.