Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Rethinking Inductive Biases for Surface Normal Estimation (2403.00712v1)

Published 1 Mar 2024 in cs.CV

Abstract: Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  2. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In ICCV, pages 13137–13146, 2021.
  3. Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty. arXiv preprint arXiv:2210.03676, 2022.
  4. Marr revisited: 2d-3d alignment via surface normal prediction. In CVPR, pages 5965–5974, 2016.
  5. Found: Foot optimization with uncertain normals for surface deformation using synthetic data. arXiv preprint arXiv:2310.18279, 2023.
  6. A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), Part VI, pages 611–625, 2012.
  7. Using vanishing points for camera calibration. International journal of computer vision, 4(2):127–139, 1990.
  8. Oasis: A large-scale dataset for single image 3d in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 679–688, 2020.
  9. On the properties of neural machine translation: Encoder-decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014.
  10. The manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NeurIPS, pages 845–851, 2000.
  11. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  12. Surface normal estimation of tilted images via spatial rectifier. In Proceedings of the European Conference on Computer Vision (ECCV), Part IV, pages 265–280, 2020.
  13. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
  14. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
  15. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015.
  16. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, pages 2366–2374, 2014.
  17. Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In CVPR, pages 11826–11835, 2019.
  18. Data-driven 3d primitives for single image understanding. In ICCV, pages 3392–3399, 2013.
  19. Unfolding an indoor origami world. In Proceedings of the European Conference on Computer Vision (ECCV), Part VI, pages 687–702, 2014.
  20. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, pages 4340–4349, 2016.
  21. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
  22. Automatic photo pop-up. In ACM SIGGRAPH, pages 577–584. 2005.
  23. Recovering surface layout from an image. IJCV, 75:151–172, 2007.
  24. Piecewise smooth surface reconstruction. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 295–302, 1994.
  25. Tour into the picture: using a spidery mesh interface to make animation from a single image. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 225–232, 1997.
  26. Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1418–1428, 2021.
  27. Deepmvs: Learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  28. Numerical shape from shading and occluding boundaries. Artificial intelligence, 17(1-3):141–184, 1981.
  29. 3d common corruptions and data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18963–18974, 2022.
  30. Evaluation of cnn-based single-image depth estimation methods. 2018.
  31. Extraction, matching, and pose recovery based on dominant rectangular structures. Computer Vision and Image Understanding, 100(3):274–293, 2005.
  32. Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1106–1114, 2012.
  33. Sparc: Sparse render-and-compare for cad model alignment in a single rgb image. arXiv preprint arXiv:2210.01044, 2022.
  34. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  35. Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition, pages 2136–2143. IEEE, 2009.
  36. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023.
  37. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  38. David Marr. Analysis of occluding contour. Proceedings of the Royal Society of London. Series B. Biological Sciences, 197(1129):441–475, 1977.
  39. 3d ken burns effect from a single image. ACM Transactions on Graphics (ToG), 38(6):1–15, 2019.
  40. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  41. Geonet: Geometric neural network for joint depth and surface normal estimation. In CVPR, pages 283–291, 2018.
  42. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
  43. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
  44. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021.
  45. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Part III, pages 234–241, 2015.
  46. Clear grasp: 3d shape estimation of transparent objects for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3634–3642. IEEE, 2020.
  47. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), Part V, pages 746–760, 2012.
  48. Super-convergence: Very fast training of residual networks using large learning rates. arXiv preprint arXiv:1708.07120, 2018.
  49. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  50. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
  51. Vplnet: Deep single view normal estimation with vanishing points and lines. In CVPR, pages 689–698, 2020a.
  52. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020b.
  53. Designing deep networks for surface normal estimation. In CVPR, pages 539–547, 2015.
  54. Icon: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13286–13296. IEEE, 2022.
  55. Econ: Explicit clothed humans optimized via normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 512–523, 2023.
  56. Transformer-based attention networks for continuous pixel-wise prediction. In ICCV, pages 16269–16279, 2021.
  57. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
  58. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018.
  59. Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11197–11206, 2020.
  60. Monograspnet: 6-dof grasping with a single rgb image. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1708–1714. IEEE, 2023.
  61. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 340–349, 2018.
  62. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  63. Nicer-slam: Neural implicit scene encoding for rgb slam. arXiv preprint arXiv:2302.03594, 2023.
Citations (20)

Summary

  • The paper introduces pixel-wise ray direction inputs and rotational constraints to improve the architectural design for surface normal estimation.
  • The method enforces piecewise smoothness while preserving sharp boundaries, yielding robust performance in unconstrained, in-the-wild settings.
  • Empirical results show the approach outperforms state-of-the-art models using a smaller dataset, demonstrating efficiency and enhanced generalization.

Enhancing Surface Normal Estimation via Inductive Biases and Rotational Constraints

Introduction

Surface normal estimation is a key task in computer vision, underpinning a range of applications from 3D reconstruction to robotic manipulation. Despite its importance, the task has been traditionally approached with models carrying general-purpose inductive biases, which, as this paper highlights, might limit their performance and generalization capabilities, especially when dealing with unconstrained, in-the-wild scenarios. Herein, we describe an innovative approach that rethinks the inductive biases necessary for accurate surface normal estimation, proposing a method that directly incorporates pixel-wise ray direction and models the relative rotation between neighboring pixels' normals. This architectural innovation enables the generation of highly detailed, crisp yet piecewise smooth surface normal predictions across images of arbitrary resolution and aspect ratio.

The landscape of surface normal estimation has been shaped significantly by the advent of deep learning, with early efforts leveraging handcrafted features and discretized output spaces. Gradually, the field has moved towards convolutional neural networks (CNNs) and, more recently, transformer models, aiming to capitalize on their capacity for handling complex spatial hierarchies and relationships. However, state-of-the-art approaches often borrow inductive biases from proximal tasks such as depth estimation and semantic segmentation - a practice that, while beneficial in certain contexts, might not be fully aligned with the unique characteristics and requirements of surface normal estimation.

Inductive Biases for Surface Normal Estimation

The essence of this work is the identification and integration of task-specific inductive biases into a deep learning framework for improved surface normal estimation. Key to this approach are two architectural novelties:

  1. Encoding per-pixel ray direction as input to the network facilitates camera intrinsics-aware inference, enhancing the model's ability to generalize across varying camera configurations and viewing conditions.
  2. The introduction of a rotation estimation component that models the relative rotation between neighboring pixels' normals as a form of axis-angle representation. This granularity enables the model to generate predictions that are simultaneously smooth within surfaces and sharply delineated at their boundaries.

Methodology and Approach

The proposed method, detailed through a network architecture incorporating convolutional layers for initial prediction and recurrent units for iterative refinement, is noteworthy for several reasons. The integration of per-pixel ray direction into the network input directly tackles the challenge of camera intrinsics variability. Furthermore, the novel use of rotational constraints between pixels offers a structured way to enforce piecewise smoothness in the estimated normals, an attribute often desired but hard to ensure in practice.

Empirically, the approach is validated against a recent state-of-the-art model based on the Vision Transformer (ViT) and is found to deliver superior generalization capability. This is particularly evident in challenging in-the-wild scenarios, where our method excels in predicting highly detailed and accurate surface normals. Notably, the proposed model achieves these results despite being trained on a significantly smaller dataset, highlighting its efficiency and robustness.

Implications and Future Directions

The research presents a compelling case for a more nuanced consideration of inductive biases in the design of models for surface normal estimation. By aligning the architectural features closely with the task-specific demands, it demonstrates the possibility of achieving high generalization capability and robust performance across diverse imaging conditions. Looking ahead, this work opens several avenues for exploration, including the potential for camera calibration using the model itself and extending the approach to other vision tasks where geometric understanding is critical.

Conclusion

In summary, this paper marks a significant step forward in the quest for accurate and robust surface normal estimation, offering a methodology that is both practically effective and theoretically grounded. The proposed approach underscores the importance of task-specific inductive biases and opens up new possibilities for advancing state-of-the-art surface normal estimation and its applications in computer vision and beyond.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 posts and received 451 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube