View Selection for 3D Captioning via Diffusion Ranking (2404.07984v1)
Abstract: Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
- Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023a.
- Photorealistic text-to-image diffusion models with deep language understanding. 2022.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Objaverse: A universe of annotated 3d objects. 2023a.
- Mosaic-sdf for 3d generative models. arXiv preprint arXiv:2312.09222, 2023.
- Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023a.
- Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023a.
- Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223, 2023a.
- Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023a.
- Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766, 2024.
- Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023b.
- Regionblip: A unified multi-modal pre-training framework for holistic and regional comprehension. arXiv preprint arXiv:2308.02299, 2023.
- X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799, 2023.
- Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv, 2023b.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023b.
- T3 bench: Benchmarking current progress in text-to-3d generation. arXiv preprint arXiv:2310.02977, 2023.
- Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023c.
- Hexagen3d: Stablediffusion is just one step away from fast and diverse text-to-3d generation. arXiv preprint arXiv:2401.07727, 2024.
- Gpt4point: A unified framework for point-language understanding and generation. arXiv preprint arXiv:2312.02980, 2023.
- Uni3d-llm: Unifying point cloud perception, generation and editing with large language models. arXiv preprint arXiv:2402.03327, 2024a.
- Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651, 2023a.
- Onellm: One framework to align all modalities with language. arXiv preprint arXiv:2312.03700, 2023.
- Model composition for multimodal large language models. arXiv preprint arXiv:2402.12750, 2024.
- Pi3d: Efficient text-to-3d generation with pseudo-image diffusion. arXiv preprint arXiv:2312.09069, 2023a.
- Evaluating vlms for score-based, multi-probe annotation of 3d objects. arXiv preprint arXiv:2311.17851, 2023.
- Zero-shot text-guided object generation with dream fields. arXiv preprint arXiv:2112.01455, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
- Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023b.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023b.
- Att3d: Amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349, 2023.
- Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
- Neural shape compiler: A unified framework for transforming between text, point cloud, and program. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=gR9UVgH8PZ.
- Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. arXiv preprint arXiv:2312.04963, 2023.
- Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1148–1156, 2023c.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
- Taps3d: Text-guided 3d textured shape generation from pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16805–16815, 2023.
- Text2tex: Text-driven texture synthesis via diffusion models. arXiv, 2023d.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv, 2022.
- Zero-1-to-3: Zero-shot one image to 3d object. arXiv, 2023b.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023c.
- Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8446–8455, 2023.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023b.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Cascade-zero123: One image to highly consistent 3d with self-prompted nearby views. arXiv preprint arXiv:2312.04424, 2023e.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Generative modeling by estimating gradients of the data distribution. NeurIPS, 2019.
- Denoising diffusion probabilistic models. NeurIPS, 33, 2020.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Diffusion models beat gans on image classification. arXiv preprint arXiv:2307.08702, 2023.
- Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023d.
- Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023b.
- Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders. arXiv preprint arXiv:2309.01141, 2023d.
- Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816, 2023.
- A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
- Generative models: What do they know? do they know things? let’s find out! arXiv preprint arXiv:2311.17137, 2023.
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
- Shap-e: Generating conditional 3d implicit functions. arXiv, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- OpenAI. Gpt-4 technical report. arXiv, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Tiange Luo (13 papers)
- Justin Johnson (56 papers)
- Honglak Lee (174 papers)