Connecting NeRFs, Images, and Text
Abstract: Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.
- Neural processing of tri-plane hybrid neural fields. In The Twelfth International Conference on Learning Representations, 2024.
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020b.
- Scannerf: a scalable benchmark for neural radiance fields. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 816–825, 2023a.
- Deep learning on implicit neural representations of shapes. In International Conference on Learning Representations (ICLR), 2023b.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- From data to functa: Your data point is a function and you can treat it like one. In International Conference on Machine Learning, pages 5694–5725. PMLR, 2022.
- Plenoxels: Radiance fields without neural networks. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Robert Hecht-Nielsen. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pages 129–135. Elsevier, 1990.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Nerf-rpn: A general framework for object detection in nerfs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23528–23538, 2023.
- Generating adversarial examples with graph neural networks. In Uncertainty in Artificial Intelligence, pages 1556–1564. PMLR, 2021.
- Parameter prediction for unseen deep architectures. In Advances in Neural Information Processing Systems, 2021.
- Graph neural networks for learning equivariant representations of neural networks. In The Twelfth International Conference on Learning Representations, 2023.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Graph metanetworks for processing diverse neural architectures. In The Twelfth International Conference on Learning Representations, 2024.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Neural network branching for neural network verification. In International Conference on Learning Representations, 2020.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022.
- Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer, 2020.
- Nerf in the dark: High dynamic range view synthesis from noisy raw images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16190–16199, 2022.
- Ansh Mittal. Neural radiance fields: Past, present, and future. arXiv preprint arXiv:2304.10050, 2023.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
- Equivariant architectures for learning in deep weight spaces. In International Conference on Machine Learning, 2023.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
- Deep learning on 3d neural fields. arXiv preprint arXiv:2312.13277, 2023.
- Self-supervised representation learning on neural network weights for model characteristic prediction. In Advances in Neural Information Processing Systems, 2021.
- Improved generalization of weight space networks via augmentations. arXiv preprint arXiv:2402.04081, 2024.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, page 1100612. International Society for Optics and Photonics, 2019.
- Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
- Predicting neural network accuracy from weights. arXiv, abs/2002.11448, 2020.
- Git: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022a.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
- One-peace: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172, 2023a.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023b.
- Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2021.
- NExt-GPT: Any-to-any multimodal LLM, 2024.
- Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
- Nerf-supervision: Learning dense object descriptors from neural radiance fields. In 2022 international conference on robotics and automation (ICRA), pages 6496–6503. IEEE, 2022.
- Deep learning on 3D neural fields. arXiv preprint arXiv:2312.13277, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
- Permutation equivariant neural functionals. Advances in neural information processing systems, 37, 2023a.
- Neural functional transformers. Advances in neural information processing systems, 37, 2023b.
- Universal neural functionals. arXiv preprint arXiv:2402.05232, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.