When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models (2405.10255v1)
Abstract: As LLMs evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied AI systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
- L. Chen et al. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023.
- H. Sha et al. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
- D. Fu et al. Drive like a human: Rethinking autonomous driving with large language models. In WACV, pp. 910–919, 2024.
- Z. Xu et al. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
- X. Ma et al. Both style and fog matter: Cumulative domain adaptation for semantic foggy scene understanding. In CVPR, pp. 18922–18931, 2022.
- R. T. Azuma. A survey of augmented reality. Presence: teleoperators & virtual environments, 6(4):355–385, 1997.
- J. Carmigniani and B. Furht. Augmented reality: an overview. Handbook of augmented reality, pp. 3–46, 2011.
- A. B. Craig. Understanding augmented reality: Concepts and applications. Newnes, 2013.
- S. Feiner et al. A touring machine: Prototyping 3d mobile augmented reality systems for exploring the urban environment. Personal Technologies, 1:208–217, 1997.
- A. Brohan et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
- D. Zheng et al. Towards learning a generalist model for embodied navigation. arXiv preprint arXiv:2312.02010, 2023.
- C. H. Song et al. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In ICCV, pp. 2998–3009, 2023.
- W. Huang et al. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- R. Mirjalili et al. Lan-grasp: Using large language models for semantic object grasping. arXiv preprint arXiv:2310.05239, 2023.
- X. Li et al. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217, 2023.
- Z. Yuan et al. Visual programming for zero-shot open-vocabulary 3d visual grounding. arXiv preprint arXiv:2311.15383, 2023.
- S. Zhang et al. Agent3d-zero: An agent for zero-shot 3d understanding, 2024.
- K. M. Jatavallabhula et al. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- S. Chen et al. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651, 2023.
- Z. Guo et al. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance. arXiv preprint arXiv:2303.16894, 2023.
- J. Yang et al. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023.
- A. Abdelreheem et al. Zero-shot 3d shape correspondence. In SIGGRAPH Asia 2023 Conference Papers, pp. 1–11, 2023.
- J. Fang et al. Transcribe3d: Grounding llms using transcribed information for 3d referential reasoning with self-corrected finetuning. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
- Y. Hong et al. Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024.
- X. Zhou et al. Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting. arXiv preprint arXiv:2402.07207, 2024.
- C. Sun et al. 3d-gpt: Procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945, 2023.
- D. Z. Chen et al. D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding. In ECCV, pp. 487–505. Springer, 2022.
- X. Li et al. Uni3dl: Unified model for 3d and language understanding. arXiv:2310.09478, 2023.
- S. Peng et al. Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- J. Kerr et al. Lerf: Language embedded radiance fields. In ICCV, 2023.
- B. Poole et al. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- N. M. Khalid et al. Clip-mesh: Generating textured meshes from text using pretrained image-text models. arXiv preprint arXiv:2203.13333, 2022.
- C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 30, 2017.
- R. Roveri et al. PointProNets: Consolidation of Point Clouds with Convolutional Neural Networks. In CGF, 2018.
- X. Liu et al. Flownet3d: Learning scene flow in 3d point clouds. In CVPR, 2019.
- Z. Wang et al. Flownet3d++: Geometric losses for deep scene flow estimation. In WACV, 2020.
- W. Yifan et al. Differentiable surface splatting for point-based geometry processing. In ACM TOG, 2019.
- O. Wiles et al. SynSin: End-to-end View Synthesis from a Single Image. In CVPR, 2020.
- C.-H. Lin et al. Learning Efficient Point Cloud Generation for Dense 3D Object Reconstruction. In AAAI, 2018.
- L. Li et al. End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds. In CVPR, 2020.
- E. Insafutdinov and A. Dosovitskiy. Unsupervised Learning of Shape and Pose with Differentiable Point Clouds. In NeurIPS, 2018.
- B. Fei et al. Comprehensive review of deep learning-based 3d point cloud completion processing and analysis. IEEE Transactions on Intelligent Transportation Systems, 2022.
- X. Yan et al. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016.
- A. Dai et al. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In CVPR, pp. 5868–5877, 2017.
- A. Dai and M. Nießner. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In ECCV, pp. 452–468, 2018.
- S. Tulsiani et al. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In IEEE TPAMI, 2019.
- P. Henzler et al. Escaping Plato’s Cave: 3D Shape From Adversarial Rendering. In ICCV, 2019.
- S. Lombardi et al. Neural volumes: Learning dynamic renderable volumes from images. In ACM TOG, 2019.
- Y. Jiang et al. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In CVPR, 2020.
- K. Schwarz et al. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. NeurIPS, 2022.
- Q. Xu et al. A survey of deep learning-based 3d shape generation. In Computational Visual Media, 2023.
- D. Peng et al. A pde-based fast local level set method. Journal of Computational Physics, 1999.
- S. Osher et al. Level set methods and dynamic implicit surfaces. Applied Mechanics Reviews, 2004.
- V. A. Prisacariu and I. Reid. Shared shape spaces. In ICCV, 2011.
- Q. Xu et al. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. NeurIPS, 2019.
- B. Curless and M. Levoy. A volumetric method for building complex models from range images. In the 23rd annual conference on Computer graphics and interactive techniques, 1996.
- R. A. Newcombe et al. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, 2011.
- M. Niessner et al. Real-time 3d reconstruction at scale using voxel hashing. ACM TOG, 2013.
- A. Dai et al. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM TOG, 36(4):1, 2017.
- A. Dai et al. Sg-nn: Sparse generative neural networks for self-supervised scene completion of rgb-d scans. In CVPR, pp. 849–858, 2020.
- P. Mittal et al. Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
- H. Kato et al. Differentiable rendering: A survey. In arXiv, 2020.
- H. Kato et al. Neural 3D Mesh Renderer. In CVPR, 2018.
- K. Genova et al. Unsupervised Training for 3D Morphable Model Regression. In CVPR, 2018.
- OpenDR: An approximate differentiable renderer. In ECCV, 2014.
- H. Kato and T. Harada. Learning view priors for single-view 3d reconstruction. In CVPR, 2019.
- H. Rhodin et al. A Versatile Scene Model with Differentiable Visibility Applied to Generative Pose Estimation. In ICCV, 2015.
- S. Liu et al. Soft Rasterizer: A Differentiable Renderer for Image-Based 3D Reasoning. In ICCV, 2019.
- W. Chen et al. Learning to Predict 3D Objects with an Interpolationbased Differentiable Renderer. In NeurIPS, 2019.
- Y. Xie et al. Neural fields in visual computing and beyond. In CGF, 2022.
- A. Tewari et al. State of the art on neural rendering. In CGF, 2020.
- L. Mescheder et al. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019.
- S. Peng et al. Convolutional occupancy networks. In ECCV, 2020.
- Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In CVPR, 2019.
- J. J. Park et al. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
- V. Sitzmann et al. Scene representation networks: Continuous 3D-structure aware neural scene representations. In NeurIPS, 2019.
- P. Wang et al. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
- Y. Wang et al. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In ICCV, 2023.
- B. Mildenhall et al. NeRF Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- R. Martin-Brualla et al. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, 2021.
- K. Zhang et al. Nerf++: Analyzing and improving neural radiance fields. arXiv:2010.07492, 2020.
- Z. Wang et al. NeRF−−--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
- Z. Li et al. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
- A. Pumarola et al. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, 2021.
- K. Schwarz et al. Graf: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
- M. Niemeyer and A. Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
- D. Rebain et al. Derf: Decomposed radiance fields. In CVPR, 2021.
- K. Park et al. Deformable Neural Radiance Fields. ICCV, 2021.
- W. Xian et al. Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
- W. Bian et al. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR, 2023.
- J.-W. Bian et al. Porf: Pose residual field for accurate neural surface reconstruction. arXiv preprint arXiv:2310.07449, 2023.
- J. T. Barron et al. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
- A. Vaswani et al. Attention is all you need. In NeurIPS, 2017.
- M. Tancik et al. Fourier features let networks learn high frequency functions in low dimensional domains. In NeurIPS, 2020.
- S.-F. Chng et al. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In ECCV, 2022.
- L. Liu et al. Neural sparse voxel fields. NeurIPS, 2020.
- C. Sun et al. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
- A. Yu et al. Plenoxels: Radiance fields without neural networks. CVPR, 2022.
- Z. Chen et al. Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In CVPR, 2023.
- T. Müller et al. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 2022.
- B. Kerbl et al. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 2023.
- C. Reiser et al. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. SIGGRAPH, 2023.
- D. Duckworth et al. Smerf: Streamable memory efficient radiance fields for real-time large-scene exploration. arXiv preprint arXiv:2312.07541, 2023.
- T. Lu et al. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. CVPR, 2024.
- J. C. Lee et al. Compact 3d gaussian representation for radiance field. CVPR, 2024.
- Structure-from-motion revisited. In CVPR, 2016.
- C. Lassner and M. Zollhofer. Pulsar: Efficient sphere-based neural rendering. In CVPR, 2021.
- J. L. Elman. Finding structure in time. Cognitive science, 1990.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- J. Kaplan et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- W. X. Zhao et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- S. Minaee et al. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.
- J. Wei et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Q. Dong et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2023.
- T. Lin et al. A survey of transformers. AI open, 2022.
- M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- C. Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. In JMLR, 2020.
- P. J. Liu et al. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- A. Radford et al. Improving language understanding by generative pre-training. OpenAI, 2018.
- T. Le Scao et al. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
- R. Sennrich et al. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- M. Schuster and K. Nakajima. Japanese and korean voice search. In International Conference on Acoustics, Speech and Signal Processing, pp. 5149–5152, 2012.
- T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Y. Kim et al. Structured attention networks. arXiv preprint arXiv:1702.00887, 2017.
- T. Brown et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- J. Liu et al. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
- S. Min et al. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- A. Radford et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- J. Achiam et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- J. Wei et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
- Z. Zhang et al. Automatic chain of thought prompting in large language models. In ICLR, 2023.
- D. Hendrycks et al. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
- D. Saxton et al. Analysing mathematical reasoning abilities of neural models. ICLR, 2019.
- A. Patel et al. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
- S.-y. Miao et al. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 975–984, 2020.
- A. Wang et al. Glue: A multi-task benchmark and analysis platform for natural language understanding. ICLR, 2019.
- C. Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020.
- H. Touvron et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- H. Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- M. AI. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, April 2024.
- L. Ouyang et al. Training language models to follow instructions with human feedback. NeurIPS, 2022.
- R. Taori et al. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- H. W. Chung et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- S. Mangrulkar et al. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- E. J. Hu et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- T. Dettmers et al. Qlora: Efficient finetuning of quantized llms. NeurIPS, 36, 2024.
- L. Zhang et al. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023.
- J. Yosinski et al. How transferable are features in deep neural networks? NeurIPS, 2014.
- J. Howard and S. Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
- Y. Hong et al. 3d-llm: Injecting the 3d world into large language models. NeurIPS, 2023.
- B. Lester et al. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
- P. Liu et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- J. Gu et al. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
- T. Kojima et al. Large language models are zero-shot reasoners. In NeurIPS, 2022.
- Y. Wang et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022.
- V. Sanh et al. Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
- T. Shin et al. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Z. Jiang et al. How can we know what language models know? Transactions of the Association for Computational Linguistics, 2020.
- A. Prasad et al. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
- Y. Wen et al. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. NeurIPS, 2024.
- T. Vu et al. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
- X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Y. Gu et al. Ppt: Pre-trained prompt tuning for few-shot learning. In ACL, pp. 8410–8423, 2022.
- Y. Su et al. On transferability of prompt tuning for natural language processing. arXiv preprint arXiv:2111.06719, 2021.
- H. Wu and X. Shi. Adversarial soft prompt tuning for cross-domain sentiment analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
- X. Li et al. Sd4match: Learning to prompt stable diffusion model for semantic matching. arXiv preprint arXiv:2310.17569, 2023.
- J. Wu et al. Infoprompt: Information-theoretic soft prompt tuning for natural language understanding. NeurIPS, 2024.
- Z. Wang et al. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes, 2023.
- H. Huang et al. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168, 2023.
- J. Pfeiffer et al. Adapterhub: A framework for adapting transformers. arXiv preprint arXiv:2007.07779, 2020.
- A. Radford et al. Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. PMLR, 2021.
- C. Jia et al. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- J. Yang et al. Unified contrastive learning in image-text-label space. In CVPR, 2022.
- W. Kim et al. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
- Y. Li et al. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022.
- A. Singh et al. Flava: A foundational language and vision alignment model. In CVPR, pp. 15638–15650, 2022.
- X. Gu et al. Open-vocabulary detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- H. Rasheed et al. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
- M. Minderer et al. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
- Y. Zhong et al. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In CVPR, 2022.
- F. Liang et al. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- J. Ding et al. Decoupling zero-shot semantic segmentation. In CVPR, pp. 11583–11592, 2022.
- B. Li et al. Language-driven semantic segmentation. In ICLR, 2022.
- G. Ghiasi et al. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- G. Kim et al. Ocr-free document understanding transformer. In ECCV, 2022.
- Y. Xu et al. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
- B. Ni et al. Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- Z. Wang et al. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022.
- J. Li et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp. 12888–12900. PMLR, 2022.
- P. Wang et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, pp. 23318–23340. PMLR, 2022.
- J. Li et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- J.-B. Alayrac et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
- H. Liu et al. Visual instruction tuning. In NeurIPS, 2024.
- J. Ho et al. Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Y. Song et al. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- R. Rombach et al. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- J. Ho et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- U. Singer et al. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
- A. Hertz et al. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- T. Brooks et al. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- R. Mokady et al. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
- J. Wu et al. Gaussctrl: Multi-view consistent text-driven 3d gaussian splatting editing, 2024.
- M. Caron et al. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- J. Zhou et al. ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.
- M. Oquab et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- A. Kirillov et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- H. Zhang et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- N. Carion et al. End-to-end object detection with transformers. In ECCV, 2020.
- S. Liu et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- L. Tang et al. Emergent correspondence from image diffusion. NeurIPS, 2024.
- J. Tian et al. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469, 2023.
- Z. Chen et al. Scan2cap: Context-aware dense captioning in rgb-d scans. In CVPR, pp. 3193–3203, 2021.
- D. Z. Chen et al. Scanrefer: 3d object localization in rgb-d scans using natural language. ECCV, 2020.
- A. Celikyilmaz et al. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020.
- K. Papineni et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
- R. Vedantam et al. Cider: Consensus-based image description evaluation. In CVPR, pp. 4566–4575, 2015.
- N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- T. Zhang et al. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- S. Chen et al. End-to-end 3d dense captioning with vote2cap-detr. In CVPR, pp. 11124–11133, 2023.
- Z. Chen et al. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In ICCV, pp. 18109–18119, 2023.
- P. Achlioptas et al. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV, pp. 422–440. Springer, 2020.
- Z. Lu et al. Scaneru: Interactive 3d visual grounding based on embodied reference understanding. arXiv preprint arXiv:2303.13186, 2023.
- Y. Zhang et al. Multi3drefer: Grounding text description to multiple 3d objects. In ICCV, pp. 15225–15236, 2023.
- W. Huang et al. Dense object grounding in 3d scenes. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 5017–5026, 2023.
- P.-H. Huang et al. Text-guided graph neural networks for referring 3d instance segmentation. In AAAI, volume 35, pp. 1610–1618, 2021.
- X. Ma et al. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022.
- D. Azuma et al. Scanqa: 3d question answering for spatial scene understanding. In CVPR, 2022.
- T. Qian et al. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
- Y. Hong et al. 3d concept learning and reasoning from multi-view images. In CVPR, pp. 9202–9212, 2023.
- P. Anderson et al. Spice: Semantic propositional image caption evaluation. In ECCV, pp. 382–398. Springer, 2016.
- P. Anderson et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
- P. Anderson et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pp. 3674–3683, 2018.
- G. Ilharco et al. General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446, 2019.
- J. Gu et al. Vision-and-language navigation: A survey of tasks, methods, and future directions. arXiv preprint arXiv:2203.12667, 2022.
- M. Shridhar et al. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pp. 894–906. PMLR, 2022.
- H.-H. Lee et al. Text-to-3d shape generation. arXiv preprint arXiv:2403.13289, 2024.
- L. Xue et al. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In CVPR, pp. 1179–1189, 2023.
- T. Wu et al. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. arXiv preprint arXiv:2401.04092, 2024.
- I. Armeni et al. 3d semantic parsing of large-scale indoor spaces. In CVPR, pp. 1534–1543, 2016.
- S. Sengupta et al. Urban 3d semantic modelling using stereo vision. In ICRA, pp. 580–585. IEEE, 2013.
- J. McCormac et al. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In ICRA, pp. 4628–4635. IEEE, 2017.
- J. Huang et al. Generative 3d part assembly via dynamic graph learning. In NeurIPS, 2020.
- J. Cheng et al. Score-pa: Score-based 3d part assembly. British Machine Vision Conference, 2023.
- L. Jiang et al. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, pp. 4867–4876, 2020.
- J. Hou et al. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In CVPR, pp. 4421–4430, 2019.
- W. Wang et al. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In CVPR, pp. 2569–2578, 2018.
- L. Han et al. Occuseg: Occupancy-aware 3d instance segmentation. In CVPR, pp. 2940–2949, 2020.
- X. Song et al. Apollocar3d: A large 3d car instance understanding benchmark for autonomous driving. In CVPR, pp. 5452–5462, 2019.
- G. Zhan et al. Amodal ground truth and completion in the wild. In CVPR, 2024.
- G. Zhan et al. What does stable diffusion know about the 3d scene? In arXiv:2310.06836, 2023.
- M. Feng et al. Exploring hierarchical spatial layout cues for 3d point cloud based scene graph prediction. IEEE Transactions on Multimedia, 2023.
- C. Zhang et al. Holistic 3d scene understanding from a single image with implicit representation. In CVPR, pp. 8833–8842, 2021.
- C. Zhang et al. Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization. In ICCV, pp. 12632–12641, 2021.
- G. Zhan et al. A tri-layer plugin to improve occluded detection. British Machine Vision Conference, 2022.
- A. Delitzas et al. Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In CVPR, 2024.
- K. Cheng et al. Learning environment-aware affordance for 3d articulated object manipulation under occlusions. In NeurIPS, 2023.
- Y. Qiu et al. 3d-aware scene change captioning from multiview images. IEEE Robotics and Automation Letters, 2020.
- S. Looper et al. 3d vsg: Long-term semantic scene change prediction through 3d variable scene graphs. In ICRA, pp. 8179–8186. IEEE, 2023.
- R. Fu et al. Scene-llm: Extending language model for 3d visual understanding and reasoning, 2024.
- R. Xu et al. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
- Z. Qi et al. Gpt4point: A unified framework for point-language understanding and generation. arXiv preprint arXiv:2312.02980, 2023.
- Z. Li et al. 3dmit: 3d multi-modal instruction tuning for scene understanding. arXiv preprint arXiv:2401.03201, 2024.
- J. Huang et al. An embodied generalist agent in 3d world. In ICML, 2024.
- S. Yang et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074, 2023.
- Z. Guo et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
- D. Liu et al. 3daxiesprompts: Unleashing the 3d spatial task capabilities of gpt-4v. arXiv preprint arXiv:2312.09738, 2023.
- W. Chen et al. Leveraging large language models for robot 3d scene understanding. arXiv preprint arXiv:2209.05629, 2022.
- K. Rana et al. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
- H. Zhen et al. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024.
- Z. Xiao et al. Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918, 2023.
- X. L. Li et al. A systematic investigation of commonsense knowledge in large language models. In Y. Goldberg et al., editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11838–11855, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- H. Wang et al. Beyond first impressions: Integrating joint multi-modal cues for comprehensive 3d representation. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 3403–3414, 2023.
- C. Nash et al. Polygen: An autoregressive generative model of 3d meshes. In ICML, pp. 7220–7229. PMLR, 2020.
- F. De La Torre et al. Llmr: Real-time prompting of interactive worlds using large language models. arXiv preprint arXiv:2309.12276, 2023.
- Y. Siddiqui et al. Meshgpt: Generating triangle meshes with decoder-only transformers. arXiv preprint arXiv:2311.15475, 2023.
- F. Yin et al. Shapegpt: 3d shape generation with a unified multi-modal language model. arXiv preprint arXiv:2311.17618, 2023.
- Y. Yang et al. Holodeck: Language guided generation of 3d embodied ai environments. In CVPR, volume 30, pp. 20–25. IEEE/CVF, 2024.
- A. Chang et al. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- Chatgpt. https://openai.com/blog/chatgpt. Accessed: 2023-07-22.
- D. He et al. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 2344–2352, 2021.
- X. Yu et al. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, pp. 19313–19322, 2022.
- J. Zhou et al. Uni3d: Exploring unified 3d representation at scale. In ICLR, 2024.
- Q. Gu et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
- B. Cheng et al. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, pp. 4490–4499, 2018.
- M. Deitke et al. Objaverse: A universe of annotated 3d objects. In CVPR, pp. 13142–13153, 2023.
- Z. Zhu et al. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In ICCV, pp. 2911–2921, 2023.
- J. Wald et al. Rio: 3d object instance re-localization in changing indoor environments. In ICCV, pp. 7658–7667, 2019.
- A. X. Chang et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- M. A. Uy et al. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In ICCV, 2019.
- S. Zhang et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
- R. Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them, 2022.
- A. Guzhov et al. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980, 2022.
- N. Mu et al. Slip: Self-supervision meets language-image pre-training. In ECCV, pp. 529–544, 2022.
- R. Girdhar et al. Imagebind: One embedding space to bind them all. In CVPR, pp. 15180–15190, 2023.
- Z. Hu et al. SceneCraft: An LLM agent for synthesizing 3D scene as Blender code. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
- J. Zhang et al. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. In ICCV, pp. 2048–2059, 2023.
- H. Ha and S. Song. Semantic Abstraction: Open-world 3D scene understanding from 2D vision-language models. In CoRL, 2022.
- K. Yamazaki et al. Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation. arXiv preprint arXiv:2310.03923, 2023.
- X. Zou et al. Segment everything everywhere all at once. NeurIPS, 36, 2024.
- R. Ding et al. Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
- J. Yang et al. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- S. Lu et al. Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In Conference on Robot Learning, pp. 1610–1620. PMLR, 2023.
- Y. Cao et al. Coda: Collaborative novel box discovery and cross-modal alignment for open-vocabulary 3d object detection. NeurIPS, 36, 2023.
- A. Takmaz et al. OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In NeurIPS, 2023.
- P. D. Nguyen et al. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. arXiv preprint arXiv:2312.10671, 2023.
- Z. Huang et al. Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation. arXiv preprint, 2023.
- D. Rozenberszki et al. Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, pp. 125–141. Springer, 2022.
- S. Kobayashi et al. Decomposing nerf for editing via feature field distillation. NeurIPS, 35:23311–23330, 2022.
- N. Tsagkas et al. Vl-fields: Towards language-grounded neural implicit spatial representations. arXiv preprint arXiv:2305.12427, 2023.
- K. Liu et al. Weakly supervised 3d open-vocabulary segmentation. NeurIPS, 36, 2024.
- M. Qin et al. Langsplat: 3d language gaussian splatting. arXiv preprint arXiv:2312.16084, 2023.
- Y. Bhalgat et al. N2f2: Hierarchical scene understanding with nested neural feature fields, 2024.
- R. Rombach et al. High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, 2022.
- C. Saharia et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
- A. Jain et al. Zero-shot text-guided object generation with dream fields. In CVPR, pp. 867–876, 2022.
- A. Sanghi et al. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, pp. 18603–18613, 2022.
- O. Michel et al. Text2mesh: Text-driven neural stylization for meshes. In CVPR, pp. 13492–13502, 2022.
- C.-H. Lin et al. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- R. Chen et al. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, pp. 22246–22256, 2023.
- T. Shen et al. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
- Z. Wang et al. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS, 36, 2024.
- J. Xu et al. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, pp. 20908–20918, 2023.
- Y. Shi et al. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- J. Zhang et al. Text2nerf: Text-driven 3d scene generation with neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2024.
- E. Richardson et al. Texture: Text-guided texturing of 3d shapes. In ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11, 2023.
- D. Z. Chen et al. Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, pp. 18558–18568, 2023.
- D. Z. Chen et al. Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261, 2023.
- R. Jiang et al. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In ICCV, pp. 14371–14382, 2023.
- F. Hong et al. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM TOG, 41(4):1–19, 2022.
- C. Diller and A. Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. In CVPR, 2024.
- L. Li and A. Dai. Genzi: Zero-shot 3d human-scene interaction generation. In CVPR, 2024.
- A. Vilesov et al. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907, 2023.
- R. Po and G. Wetzstein. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
- G. Gao et al. Graphdreamer: Compositional 3d scene synthesis from scene graphs. In CVPR, 2024.
- Z. Ziyu et al. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In ICCV, 2023.
- B. Chen et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168, 2024.
- A. Delitzas et al. Multi-clip: Contrastive vision-language pre-training for question answering tasks in 3d scenes. arXiv preprint arXiv:2306.02329, 2023.
- Z. Yuan et al. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV, pp. 1791–1800, 2021.
- J. Roh et al. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning, pp. 1046–1056. PMLR, 2022.
- L. Zhao et al. 3dvg-transformer: Relation modeling for visual grounding on point clouds. In ICCV, pp. 2928–2937, 2021.
- T. Luo et al. Scalable 3d captioning with pretrained models. NeurIPS, 36, 2024.
- K. Chen et al. Text2shape: Generating shapes from natural language by learning joint embeddings. In ACCV, pp. 100–116. Springer, 2019.
- B. Jia et al. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. arXiv preprint arXiv:2401.09340, 2024.
- T. Wang et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. arXiv preprint arXiv:2312.16170, 2023.
- A. Abdelreheem et al. Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes. In WACV, pp. 3524–3534, 2024.
- Z. Lin et al. Wildrefer: 3d object localization in large-scale dynamic scenes with multi-modal visual data and natural language. arXiv preprint arXiv:2304.05645, 2023.
- T. Miyanishi et al. Cross3dvg: Baseline and dataset for cross-dataset 3d visual grounding on different rgb-d scans. arXiv preprint arXiv:2305.13876, 2023.
- S. Kato et al. Arkitscenerefer: Text-based localization of small objects in diverse real-world 3d indoor scenes. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 784–799, 2023.
- S. Ye et al. 3d question answering, 2021.
- X. Yan et al. Comprehensive visual question answering on point clouds through compositional scene manipulation. IEEE Transactions on Visualization & Computer Graphics, pp. 1–13, 2023.
- M. Li et al. M3dbench: Let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763, 2023.
- Z. Yin et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
- A. Dai et al. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- F. Zeng et al. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226, 2023.
- H. Zhou et al. Language-conditioned learning for robotic manipulation: A survey. arXiv preprint arXiv:2312.10807, 2023.
- H. Caesar et al. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- P. Cong et al. Stcrowd: A multimodal dataset for pedestrian perception in crowded scenes. In CVPR, pp. 19608–19617, 2022.
- G. Baruch et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021.
- S. K. Ramakrishnan et al. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- K. Yadav et al. Habitat-matterport 3d semantics dataset. arXiv preprint arXiv:2210.05633, 2022.
- J. Wald et al. Learning 3d semantic scene graphs from 3d indoor reconstructions. In CVPR, 2020.
- Xianzheng Ma (13 papers)
- Yash Bhalgat (23 papers)
- Brandon Smart (4 papers)
- Shuai Chen (69 papers)
- Xinghui Li (32 papers)
- Jian Ding (132 papers)
- Jindong Gu (101 papers)
- Dave Zhenyu Chen (12 papers)
- Songyou Peng (41 papers)
- Jia-Wang Bian (22 papers)
- Philip H Torr (1 paper)
- Marc Pollefeys (229 papers)
- Matthias Nießner (177 papers)
- Angel X. Chang (58 papers)
- Iro Laina (41 papers)
- Victor Adrian Prisacariu (36 papers)
- Ian D Reid (2 papers)