Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D (2403.18922v1)

Published 27 Mar 2024 in cs.CV

Abstract: In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
  2. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  5. Gnesf: Generalizable neural semantic fields. arXiv preprint arXiv:2310.15712, 2023a.
  6. Towards label-free scene understanding by vision foundation models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  7. Stylizing 3d scene via implicit representation and hypernetwork. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1475–1484, 2022.
  8. Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3193–3204, 2023.
  9. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 11–20, 1996.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
  11. Google scanned objects: A high-quality dataset of 3d scanned household items. arXiv e-prints, pages arXiv–2204, 2022.
  12. Unified implicit neural stylization. arXiv preprint arXiv:2204.01943, 2022a.
  13. Nerf-sos: Any-view self-supervised object segmentation on complex scenes. arXiv preprint arXiv:2209.08776, 2022b.
  14. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2367–2376, 2019.
  15. Interactive segmentation of radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4201–4211, 2023.
  16. The lumigraph. In Proceedings of the 23rd annual conference on computer graphics and interactive techniques (SIGGRAPH 1996), pages 43–54, 1996.
  17. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023.
  18. Ponder: Point cloud pre-training via neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16089–16098, 2023.
  19. Learning to stylize novel views. In ICCV, 2021.
  20. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  21. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5070–5079, 2019.
  22. Ddcolor: Towards photo-realistic image colorization via dual decoders. arXiv preprint arXiv:2212.11613, 2022.
  23. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
  24. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  25. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022.
  26. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.
  27. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31–42, 1996.
  28. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  29. Semantic ray: Learning a generalizable semantic field with cross-reprojection attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17386–17396, 2023a.
  30. Stylerf: Zero-shot 3d style transfer of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8338–8348, 2023b.
  31. Weakly supervised 3d open-vocabulary segmentation. arXiv preprint arXiv:2305.14093, 2023c.
  32. Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347, 2023d.
  33. Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics (TVCG), 1995.
  34. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  35. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer, 2020.
  36. Nerf in the dark: High dynamic range view synthesis from noisy raw images. arXiv preprint arXiv:2111.13679, 2021.
  37. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  38. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
  39. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  40. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  41. Imagenet-21k pretraining for the masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  43. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  44. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  45. Light field neural rendering. arXiv preprint arXiv:2112.09687, 2021.
  46. Generalizable patch-based neural rendering. arXiv preprint arXiv:2207.10662, 2022.
  47. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  48. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV), pages 443–453. IEEE, 2022.
  49. Is attention all that nerf needs? In The Eleventh International Conference on Learning Representations, 2022.
  50. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. arXiv preprint arXiv:2112.05139, 2021a.
  51. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022.
  52. Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023.
  53. Label propagation through linear neighborhoods. In Proceedings of the 23rd international conference on Machine learning, pages 985–992, 2006.
  54. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021b.
  55. Ibrnet: Learning multi-view image-based rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021c.
  56. Consistent video style transfer via relaxation and regularization. IEEE Trans. Image Process., 2020.
  57. Nex: Real-time view synthesis with neural basis expansion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  58. Ccpl: Contrastive coherence preserving loss for versatile style transfer. In European Conference on Computer Vision, pages 189–206. Springer, 2022.
  59. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 574–591. Springer, 2020.
  60. Featurenerf: Learning generalizable nerfs by distilling foundation models. arXiv preprint arXiv:2303.12786, 2023.
  61. pixelnerf: Neural radiance fields from one or few images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  62. Resshift: Efficient diffusion model for image super-resolution by residual shifting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  63. Deep sets. Advances in neural information processing systems, 30, 2017.
  64. Arf: Artistic radiance fields. In European Conference on Computer Vision, pages 717–733. Springer, 2022a.
  65. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  66. Domain enhanced arbitrary image style transfer via contrastive learning. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–8, 2022b.
  67. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.
  68. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  69. Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm. arXiv preprint arXiv:2310.08586, 2023.
  70. Learning from labeled and unlabeled data with label propagation. ProQuest Number: INFORMATION TO ALL USERS, 2002.
  71. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mukund Varma T (10 papers)
  2. Peihao Wang (43 papers)
  3. Zhiwen Fan (52 papers)
  4. Zhangyang Wang (375 papers)
  5. Hao Su (218 papers)
  6. Ravi Ramamoorthi (65 papers)
Citations (3)

Summary

Advanced 3D Predictions with Lift3D: Transforming 2D Vision Models to 3D

Introduction

The progression in 2D image understanding has seen remarkable advancements owing to the development of comprehensive image datasets and the innovation in neural network architectures. This has contributed significantly to achievements in diverse tasks such as semantic segmentation, style transfer, and scene editing. However, the application of these advancements to 3D understanding has been throttled by the scarcity of large, well-labeled multi-view image datasets. This limitation prompts a crucial inquiry: can we extend the prowess of 2D vision models to interpret and manipulate 3D data consistently across multiple views?

In response to this, the paper introduces Lift3D, a novel framework designed to elevate any pre-trained 2D vision model into the 3D domain, enabling it to produce view-consistent predictions. By emphasizing on a scene and operator-agnostic approach, Lift3D showcases remarkable flexibility, allowing it to adapt to various downstream tasks or scenes without the need for additional adjustments. Notably, its ability to resolve inconsistencies in multi-view predictions sets it apart, offering significant contributions to the fields of open vocabulary segmentation and text-driven scene editing.

Method Overview

Lift3D leverages the intermediate feature maps generated by 2D operators, refining and propagating these through a cleverly designed algorithm to achieve smooth and consistent predictions across views. It encapsulates this process within a pipeline that synthesizes novel view feature maps from given multi-view images and their corresponding 2D predictions. This process is anchored on image-based rendering principles and volume rendering techniques, enabling Lift3D to interpolate novel views in feature spaces generated by pre-trained 2D visual models.

The intricate architecture of Lift3D is conjured under the guidance of existing novel view synthesis techniques, which learn to aggregate pixels with epipolar constraints to synthesize novel views. By treating dense features as colors, Lift3D manages to interpolate features across multiple views, subsequently utilizing the decoder from the pre-trained 2D model to unveil the final 3D predictions.

Experiments and Results

Lift3D’s prowess was put to the test across various 3D vision tasks, including semantic segmentation, style transfer, and scene editing, producing outcomes comparable to, and at times surpassing, methods specialized for these tasks. A noteworthy feature of Lift3D is its zero-shot capability, requiring no scene-specific or operator-specific training. This encapsulates the novelty and potential of Lift3D in harnessing the capacity of 2D vision models for 3D applications effectively.

The experiments conducted validate Lift3D’s theoretical premise, demonstrating its exceptional ability to generalize across different feature backbones and tasks. Its performance against state-of-the-art methods in semantic segmentation, and its contribution to pioneering 3D extensions of 2D vision operations like image colorization and open vocabulary segmentation, underscore its practical significance.

Conclusions and Implications

Lift3D presents a groundbreaking approach to bridging the gap between 2D and 3D vision models. Its unique ability to produce view-consistent 3D predictions from any 2D vision model without requiring additional training or optimization marks a significant stride in the field of 3D scene understanding.

The implications of this research are profound, offering a scalable solution to the challenge of data scarcity in 3D vision. By making it feasible to extend the functionalities of 2D models to 3D contexts universally, Lift3D opens new avenues for research and applications in autonomous driving, robotics, and beyond.

Future research might explore the extension of Lift3D's capabilities to handle more complex 3D representations and interactions, potentially leading to more comprehensive 3D understanding systems. The evolution of Lift3D and similar frameworks will undoubtedly play a pivotal role in the fusion of 2D and 3D vision technologies, driving forward the boundaries of what is achievable in the field of artificial intelligence and computer vision.