Topologically-Aware Deformation Fields for Single-View 3D Reconstruction (2205.06267v2)

Published 12 May 2022 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: We present a framework for learning 3D object shapes and dense cross-object 3D correspondences from just an unaligned category-specific image collection. The 3D shapes are generated implicitly as deformations to a category-specific signed distance field and are learned in an unsupervised manner solely from unaligned image collections and their poses without any 3D supervision. Generally, image collections on the internet contain several intra-category geometric and topological variations, for example, different chairs can have different topologies, which makes the task of joint shape and correspondence estimation much more challenging. Because of this, prior works either focus on learning each 3D object shape individually without modeling cross-instance correspondences or perform joint shape and correspondence estimation on categories with minimal intra-category topological variations. We overcome these restrictions by learning a topologically-aware implicit deformation field that maps a 3D point in the object space to a higher dimensional point in the category-specific canonical space. At inference time, given a single image, we reconstruct the underlying 3D shape by first implicitly deforming each 3D point in the object space to the learned category-specific canonical space using the topologically-aware deformation field and then reconstructing the 3D shape as a canonical signed distance field. Both canonical shape and deformation field are learned end-to-end in an inverse-graphics fashion using a learned recurrent ray marcher (SRN) as a differentiable rendering module. Our approach, dubbed TARS, achieves state-of-the-art reconstruction fidelity on several datasets: ShapeNet, Pascal3D+, CUB, and Pix3D chairs. Result videos and code at https://shivamduggal4.github.io/tars-3D/

Citations (37)

View on Semantic Scholar

Summary

The paper introduces an unsupervised method for implicit 3D shape learning using topologically-aware deformation fields.
It achieves dense correspondence modeling across instances, facilitating high-fidelity texture transfer and detailed reconstruction.
The approach demonstrates state-of-the-art results on benchmarks like ShapeNet and Pix3D, highlighting robust performance on complex topologies.

Overview of Topologically-Aware Deformation Fields for Single-View 3D Reconstruction

This paper introduces "TARS," an advanced framework for 3D reconstruction and dense correspondence modeling from single-view images, aimed at overcoming conventional limitations in 3D geometry learning from unaligned image collections. The approach capitalizes on topologically-aware deformation fields to achieve single-view 3D reconstruction and category-specific correspondence without requiring 3D supervision.

Core Contributions

The key contributions of this paper are centered around several primary advancements:

Implicit 3D Shape Learning: The work presents a method to implicitly learn the shapes of 3D objects via unsupervised deformations of category-specific signed distance fields (SDF), leveraging solely unaligned image collections along with camera poses.
Topologically-Aware Deformation Fields: Introduces a novel deformation field that accounts for both geometric and topological variations within object categories, facilitating the reconstruction of varying topologies like chairs which traditional mesh-based methods struggle with.
Dense Correspondence Modeling: This approach models dense 3D correspondences across instances within a category, allowing for detailed texture transfer tasks, facilitated by understanding the inherent articulations and structural properties of the object category.
Single Image 3D Reconstruction Framework: The paper's method can reconstruct 3D shapes from a single input image by transforming an object's 3D space into a higher-dimensional canonical space and subsequently mapping this back using a signed distance field in an implicit manner.
End-to-End Differentiable Renderer: Utilizes a learned recurrent ray marcher (SRN) to support a differentiable rendering process, enhancing the model's learning efficiency.

Experimentation and Results

The authors conducted extensive experiments demonstrating TARS's superior performance on several datasets—ShapeNet, Pascal3D+, CUB-200-2011, and Pix3D chairs. Empirical results exhibit TARS's state-of-the-art reconstruction fidelity compared to existing baselines like SDF-SRN and SoftRas, highlighting its robustness despite varying dataset complexities. Notably, TARS showcased a significant ability to preserve geometric fidelity and structural correctness. Moreover, it efficiently handles topological complexities by utilizing topologically-aware deformation fields.

Implications and Future Directions

Theoretical implications present TARS as a major step forward in modeling category-level semantic structures without explicit 3D supervision, hinting at potential developments in AI models’ abilities to generalize three-dimensional understanding from two-dimensional data—an essential capability towards refining self-supervised 3D vision.

Practically, TARS unlocks applications in computer graphics and vision, particularly in areas necessitating high-fidelity 3D reconstructions from minimal input views, such as virtual reality, complex simulations, and augmented reality systems.

Future work could explore:

Reducing the dependency on known camera poses during training,
Integration with multi-object category learning systems,
Enhancements using adversarial learning techniques for complex texture synthesis.

Conclusion

The proposed TARS framework demonstrates cohesive advancements in 3D reconstruction capabilities using a single-view input. With its novel use of implicit deformation models and rendering techniques, this research contributes significantly to the body of knowledge in 3D visual understanding, paving a path towards astonishingly realistic and detailed 3D model reconstructions from sparse visual data, ultimately bridging a critical gap in current visual perception technologies.

PDF Markdown