Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
Abstract: Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal LLMs into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What this paper is about
This paper introduces Know3D, a new way to build 3D models from just one picture. The tricky part of making a 3D model from a single photo is that you can’t see the back or hidden sides. Most current tools “guess” those missing parts randomly, which can lead to weird shapes that don’t make sense. Know3D solves this by using knowledge from powerful vision-LLMs (smart systems that understand both images and text) so the hidden parts can be filled in logically—and even controlled by simple text instructions like “add a small balcony on the back.”
The main questions the researchers asked
- How can we make the unseen back side of a 3D object more realistic and less random when starting from just one photo?
- Can we let users control what goes on the hidden side using plain language (for example, “a window on the back wall”)?
- What is the best way to pass the “understanding” from a vision-LLM into a 3D generator so it uses both common sense and good geometry?
How they approached the problem
Turning one picture into a full 3D object
Think of this like building a 3D sculpture from a single front photo. Current systems are good at the front because it’s visible, but they often guess the back. Know3D connects two parts:
- A vision-LLM (VLM) that understands the photo and your text instructions.
- A 3D generation model that builds the actual 3D shape.
The goal is to pass useful “knowledge” from the smart interpreter (the VLM) into the 3D builder so the hidden sides make sense.
Teaching the model what “back view” means
The team fine-tuned an existing image-editing system (Qwen-Image-Edit) that can follow instructions and generate images. They taught it to:
- Take a front view photo of an object (like a cup or building),
- Read a text prompt describing what the back should have,
- Produce a believable back-view image that matches both the front and the text.
To do this, they created training data by rendering many 3D objects from the front and directly opposite (back) angles. They also added short, simple descriptions of the back parts (like “a door on the back wall”). During training, they mixed prompts that just say “show the back” with prompts that add extra details, so the system learned both general back views and controlled ones.
In simple terms: they taught the model exactly what “turn the object around 180 degrees” looks like, and how to add user-requested details on that hidden side.
Passing knowledge from the VLM to the 3D model
The big question: what’s the best kind of “signal” to send from the image-and-text model to the 3D generator?
They tried three options:
- VAE latents: compressed image features focused on pixels.
- DINOv3 features: image features extracted from a generated back-view image.
- Hidden “in-between” states from a diffusion model’s middle layers.
What’s a diffusion model? Imagine starting with TV static and gradually cleaning it up into a clear image, step by step. At different steps, the model holds “in-between” information. The authors found these middle-stage hidden states capture both structure (where parts should go) and meaning (what those parts are). These turned out to be the most reliable guide for 3D building.
They then injected these hidden states into a strong 3D generator (based on TRELLIS2) using an attention mechanism (you can think of this like giving the 3D model a focused set of hints about the layout and semantics of the back).
What they found and why it matters
- The back is no longer a random guess: Using knowledge from the vision-LLM, Know3D makes the hidden back sides more believable and consistent with the front.
- You can control the back with words: Users can type instructions like “a small balcony on the back” or “an outward-opening window,” and the system adjusts the 3D model accordingly.
- Better features = better 3D: The best results came from using the diffusion model’s middle-layer hidden states (not raw image features). These carried the right balance of “what it is” and “where it goes.”
- Competitive or better quality: On standard tests (ULIP and Uni3D), Know3D performed as well as or better than leading methods at matching the generated 3D shape to the input image’s meaning.
- More stable than a simple multi-view trick: Just feeding a generated back image into a multi-view 3D method didn’t work as well. Know3D’s knowledge-passing method produced cleaner, more plausible shapes from new angles.
Why this is important
- More control for creators: Game designers, animators, and hobbyists can quickly create full 3D objects from a single photo and steer the hidden parts with short text prompts.
- Fewer weird shapes: The system uses world knowledge and common sense learned from lots of image-text data, so it avoids impossible or silly geometries on the unseen sides.
- A new bridge between AI systems: The paper shows a practical way to combine the “understanding” of a vision-LLM with the “building skills” of a 3D generator—by passing the right kind of features at the right time.
Limitations and what’s next
- Still depends on understanding: If the vision-LLM misunderstands your instruction, the 3D shape can still be wrong.
- Could improve with stronger models: As multimodal models get better, the 3D results should improve too.
- Future work: Explore better ways to inject knowledge into the 3D model and train on more varied 3D data for even stronger, more reliable control.
In short, Know3D turns a hard guessing game (filling in the back of an object) into a guided process informed by language and vision. It makes 3D generation from a single image more realistic and responsive to what you ask for.
Knowledge Gaps
Below is a single, consolidated list of concrete knowledge gaps, limitations, and open questions left unresolved by the paper. Each point is phrased to be actionable for future research.
- Lack of in-the-wild validation: The approach is trained/evaluated primarily on TexVerse/HY3D-Bench assets; its robustness on real, cluttered, object-in-scene photographs with unknown intrinsics and complex backgrounds is untested.
- Domain shift to realistic assets: Generalization from stylized/CG meshes to photorealistic objects, materials, and complex textures is not established.
- Back-view quality metrics missing: No quantitative metric directly evaluates correctness/plausibility of synthesized back-side geometry relative to text prompts; reliance on ULIP/Uni3D overlooks back-view fidelity.
- Orientation correctness not measured: The fine-tuned VLM’s ability to consistently produce “true back views” (correct azimuth/elevation/roll) is not quantitatively validated.
- Physical/topological validity unassessed: No metrics or checks for watertightness, self-intersections, manifoldness, or structural stability of generated meshes.
- Controllability quantification absent: No measure of how faithfully back-side instructions are followed (e.g., edit success rate, degree of compliance, unintended changes elsewhere).
- Front-view preservation guarantees lacking: The method aims not to disturb visible geometry, but there is no quantitative analysis of how back-side prompts may inadvertently alter front-side shape/appearance.
- Conflict resolution under contradictory cues: Behavior when text prompts conflict with front-view evidence (e.g., impossible instructions) is not characterized; no strategy to arbitrate constraints.
- Limited scope of control: Control is demonstrated only for back-view components; extensions to other occluded regions (top/bottom/inside) and fine-grained spatial relations are unexplored.
- Pose/articulation stability: Robustness on articulated/humanoid subjects (pose preservation, hair/clothing dynamics) is not systematically evaluated.
- Scene-level extension: The method targets single objects; handling multi-object scenes, occlusions, and context-dependent back-side semantics is unaddressed.
- Material/texture control: The framework focuses on geometry; how to control or ensure plausible backside materials/PBR consistency is open.
- Dataset bias and annotation noise: The back-view text annotations are seeded by a VLM (Fig. 2), but annotation quality, bias, and error propagation to training are not analyzed.
- Data efficiency: It is unclear how performance scales with fewer front–back pairs and weaker textual supervision; methods for weak/unsupervised learning from internet images are not explored.
- Multilingual control: Generalization of language-controllable back-view generation beyond English prompts is untested.
- Robustness to ambiguous/adversarial prompts: No stress tests for vague, long, or adversarial instructions; failure modes and safety filters are not discussed.
- Hidden-state selection is heuristic: Using MMDiT hidden states at t=0.25 is empirically best, but a principled criterion or adaptive selection across layers/timesteps is missing.
- Portability to other bridges: It is unclear whether the hidden-state injection approach transfers to other multimodal diffusion transformers or VLM-diffusion models beyond Qwen-Image-Edit.
- End-to-end training: The 3D model consumes frozen VLM features; whether joint fine-tuning (without catastrophic forgetting) yields better alignment remains open.
- Error tolerance to flawed back-view synthesis: Although hidden states are more robust than decoded images, systematic evaluation under severely incorrect or hallucinated VLM outputs is absent.
- Efficiency/latency: Runtime, memory overhead, and scalability of running a large MMDiT at inference for hidden states (plus extra cross-attention in 3D) are not reported.
- Lightweight distillation: There is no attempt to distill the MMDiT hidden-state guidance into a compact encoder to avoid heavy VLM inference at deployment.
- User-controllable strength of guidance: The trade-off between semantic control strength (AFdit) and fidelity to front-view priors is not analyzed; no exposed knob for practitioners.
- Multi-turn/editing workflows: Interactive, iterative text edits to refine back-view content and their cumulative consistency are not explored.
- Multi-view generalization: Beyond back-view, using multiple synthesized novel views or viewpoint schedules to further constrain 3D reconstruction is not investigated.
- Handling camera intrinsics/extrinsics: Sensitivity to incorrect or unknown FoV/elevation at test time is unquantified; calibration-free robustness is open.
- Comprehensive evaluation suite: Human studies for plausibility, prompt satisfaction, and creative control are missing; benchmarks for controllable unseen-region synthesis are needed.
- Safety and bias: Potential propagation of VLM bias and inappropriate content into generated 3D assets is unaddressed; no guardrails are proposed.
- Category-wise robustness: Performance differences on symmetric vs. asymmetric objects, thin structures, transparent/reflective materials, and highly cluttered accessories are not dissected.
- Integration with physics: Incorporating physical priors (symmetry, support, stability) or simulation feedback to reduce implausible structures is left for future work.
- Representational breadth: Applicability to other 3D representations (meshes, Gaussians, point sets) and to texture/material generation is not demonstrated.
- Scaling laws: The relationship between MLLM/VLM capacity, training data scale/diversity, and controllable 3D quality is not studied.
- Reproducibility artifacts: Release status and reusability of the fine-tuned VLM, back-view annotations, and training code/pipelines are unclear.
Practical Applications
Below are actionable, real-world applications derived from Know3D’s findings and methods. Each item notes relevant sectors, what could be built, and key assumptions or dependencies that affect feasibility.
Immediate Applications
- Language-controllable single-image-to-3D asset completion; Sectors: gaming, film/VFX, AR/VR, digital content
- Use Know3D’s back-view control to turn a single concept/reference image into a complete 3D asset while specifying unseen details (“add a small balcony on the back wall”).
- Potential tools/workflows: Blender/Maya/Unreal/Unity plugins (“Back-View Composer”); a cloud API that accepts an image + prompt and returns mesh.
- Dependencies/assumptions: Non-metric geometry; works best on object categories close to TexVerse and HY3D-Bench distributions; requires access to a capable VLM-diffusion backbone (e.g., Qwen-Image-Edit) and GPU inference.
- Rapid variant generation of back-side designs for props and set dressing; Sectors: gaming, film/TV, advertising
- Generate multiple backside configurations from the same front image by changing text prompts (e.g., window/door/balcony variants).
- Tools/workflows: Shot planning tools; prop libraries with “backside-presets.”
- Dependencies/assumptions: Human review remains important; geometry plausibility, not CAD accuracy.
- E-commerce product 3D previews from limited imagery; Sectors: retail, marketing
- Build plausible 3D product models from a single hero image while merchants specify back features via templates (“zipper on back,” “logo placement”).
- Tools/workflows: “3D SKU Builder” integrated with PIM/CMS platforms; viewer for 360° spins.
- Dependencies/assumptions: Plausibility over precision; IP/licensing clearance for source images; may require category-specific prompt templates.
- Dataset augmentation for 3D model training; Sectors: AI/ML (academia/industry)
- Generate controlled back views and full assets from 2D datasets to reduce occlusion bias and diversify training corpora.
- Tools/workflows: “3D Data Augmentor” pipelines; automated prompt libraries for backside attributes.
- Dependencies/assumptions: Synthetic data risks domain shift; must tag provenance; ensure class-consistent prompts.
- Language-guided 3D completion in concept design; Sectors: industrial design, creative studios
- From a frontal sketch/render, produce a 3D proxy and iterate on unseen parts using natural language.
- Tools/workflows: CAD/Concept tools that import proxy meshes for over-modeling; interactive prompt sliders.
- Dependencies/assumptions: Proxy geometry for ideation, not manufacturing; scale and tolerance unspecified.
- Improved control and QA for single-image 3D pipelines; Sectors: 3D toolchains, studios
- Insert Know3D’s MMDiT hidden-state conditioning into existing generators (e.g., TRELLIS families) to reduce stochastic backside hallucinations and enforce semantic constraints.
- Tools/workflows: “Conformance Checker” that flags implausible backside structures against prompts; batch processing scripts.
- Dependencies/assumptions: Access to model internals for cross-attention injection; compute overhead for feature extraction.
- Educational tools for 3D/spatial reasoning; Sectors: education, edtech
- Demonstrate occlusion, symmetry, and structural inference by comparing uncontrolled vs prompt-controlled backside generation.
- Tools/workflows: Classroom apps that let students specify back features and see the 3D result.
- Dependencies/assumptions: Age-appropriate assets; simplified UIs; no safety-critical use.
- Robotics and simulation asset bootstrapping (non-safety-critical); Sectors: robotics (simulation), digital twins
- Generate plausible proxy meshes from single camera views for simulation worlds where exact geometry is not required (e.g., cluttered scene randomization).
- Tools/workflows: Sim asset ingestion scripts; language-controlled occluded features to vary difficulty.
- Dependencies/assumptions: Not suitable for precise grasp planning; physics/materials need separate assignment.
- Archivization and visualization of artifacts from photos (exploratory); Sectors: cultural heritage, museums
- Create plausible 3D stand-ins from limited-view photographs while curators specify hypothesized backside features.
- Tools/workflows: “Curator Assist” prompting presets with standardized qualifiers (“hypothesized,” “uncertain”).
- Dependencies/assumptions: Clear disclaimers; not a substitute for scientific reconstruction.
- Research kit for VLM-to-3D knowledge transfer; Sectors: academia, R&D
- Use the paper’s demonstrated advantage of intermediate MMDiT hidden states (e.g., at t≈0.25) to condition other 3D tasks (e.g., 3D-aware image editing, shape completion).
- Tools/workflows: Open-source modules for hidden-state extraction and parallel cross-attention injection.
- Dependencies/assumptions: Backbone licenses; reproducibility depends on availability of multimodal diffusion backbones.
Long-Term Applications
- Consumer-grade single-photo-to-AR object creation; Sectors: consumer software, AR platforms
- One-tap creation of AR-placable 3D objects from a phone snapshot with voice prompts for unseen sides.
- Tools/products: Mobile apps; ARKit/ARCore integrations; social sharing of 3D posts.
- Dependencies/assumptions: On-device or low-latency inference; robust handling of diverse categories and lighting; safety moderation.
- Photorealistic 3D shopping with accurate dimensions; Sectors: retail, logistics
- Combine language-controllable backside synthesis with metric constraints (size/specs) to achieve reliable virtual try-before-you-buy experiences.
- Tools/workflows: Merchant-supplied dimensions + prompts; automatic PBR material recovery.
- Dependencies/assumptions: New training with metric supervision; calibrated cameras or metadata; material estimation beyond current scope.
- Scene-level single-image-to-3D with controllable occluded regions; Sectors: real estate, AEC, gaming
- Extend Know3D from objects to rooms/facades, letting users specify what exists behind walls or out-of-view zones (e.g., “window behind this partition”).
- Tools/workflows: “Scene Reconstruction Assistant” that infers plausible layouts and generates 3D shells.
- Dependencies/assumptions: Requires scene-scale datasets and stronger priors; risk of hallucinating incorrect architecture without measurements.
- Embodied AI that uses language to reason about occluded parts; Sectors: robotics, autonomy
- Robots infer likely unseen geometry from a single view plus language cues to guide exploration and manipulation planning.
- Tools/workflows: 3D uncertainty estimation + safety filters; sensor fusion to validate hypotheses.
- Dependencies/assumptions: Safety-critical validation needed; domain adaptation and real-time performance; explicit uncertainty quantification.
- Industrial design to manufacturing bridge (CAD-constrained 3D completion); Sectors: manufacturing, product design
- Marry language-controlled completion with solid/parametric CAD constraints to produce manufacturing-ready geometry from conceptual photos.
- Tools/workflows: Constraint-aware generation; snapping to CAD primitives; tolerance enforcement.
- Dependencies/assumptions: New model families trained on CAD datasets; strong geometric guarantees.
- Cultural heritage restoration assistance using textual/historical priors; Sectors: heritage, academia
- Inject domain-specific textual knowledge (catalog entries, expert notes) to guide reconstruction of missing backs or interiors.
- Tools/workflows: Provenance-aware reconstructions with traceable textual sources and confidence maps.
- Dependencies/assumptions: Curated, expert-verified VLMs; rigorous uncertainty displays; ethical oversight.
- 3D data governance and policy frameworks for occlusion inference; Sectors: policy, standards, legal
- Standards for labeling AI-inferred geometry, documenting prompts, and watermarking generated 3D assets; guidance on IP/responsible use when reconstructing from images.
- Tools/workflows: Metadata schemas (e.g., “inferred_back_view=true”), 3D provenance tags, audit logs.
- Dependencies/assumptions: Industry and standards-body coordination; interoperable metadata across engines and marketplaces.
- Physical plausibility and affordance-aware 3D generation; Sectors: robotics, simulation, gaming
- Integrate physics constraints and affordance priors into language-guided backside synthesis to ensure stability and use-case realism.
- Tools/workflows: Differentiable physics feedback during generation; affordance classifiers as constraints.
- Dependencies/assumptions: New training loops with physics simulators; runtime cost; curriculum for diverse objects.
- Cross-domain extension to specialized objects (e.g., machinery, apparel); Sectors: fashion, manufacturing, maintenance
- Domain-tuned models that complete unseen components according to standards (e.g., garment seams, machine fasteners) from a single view and spec text.
- Tools/workflows: Domain prompt libraries; QA checklists; integration with PLM systems.
- Dependencies/assumptions: Domain datasets and taxonomies; higher penalties for errors; possible IP constraints.
- Human-in-the-loop co-creation platforms with uncertainty visualization; Sectors: creative tools, enterprise
- Interactive systems that show confidence heatmaps for unseen geometry, solicit text guidance, and refine 3D iteratively.
- Tools/workflows: UI overlays for confidence; prompt recommendations; reversible edits.
- Dependencies/assumptions: Calibrated uncertainty estimation; responsive runtimes; user experience design.
Notes on feasibility across applications:
- Core dependency: availability and licensing of high-capacity VLM-diffusion backbones and compute; the paper freezes Qwen-Image-Edit and fine-tunes with LoRA.
- Domain shift: performance drops when objects differ from training distributions; prompts and annotated data for backsides are needed for new domains.
- Accuracy vs plausibility: current outputs are plausible but not metrically guaranteed; critical in retail/manufacturing.
- Legal/ethical: reconstructing 3D from copyrighted images requires rights management and clear AI provenance labeling.
- Reliability: as acknowledged, failures in the multimodal model’s understanding propagate to 3D; stronger MLLMs and better injection strategies reduce but don’t eliminate this risk.
Glossary
- Ablation study: A controlled set of experiments comparing variations of a method to assess the impact of specific design choices. Example: "For the ablation study, we randomly selected a subset of 100 3D assets in Tex Verse [68] dataset that not in our training data."
- Azimuth: The angle of rotation around the vertical axis used to specify camera/viewpoint positions around an object. Example: "using uniform azimuth sampling with random elevation."
- Chamfer Distance (CD): A geometric distance metric measuring how far two point sets are from each other, commonly used to evaluate 3D shape reconstruction quality. Example: "we evaluate the performance using IoU and Chamfer Distance (CD)."
- Conditional Flow Matching (CFM): A training objective for generative models that learns vector fields mapping noise to data distributions under conditioning signals. Example: "optimize the model using the Conditional Flow Matching (CFM) objective [34, 49]."
- Cross-attention: An attention mechanism that conditions one set of features on another by attending to key-value pairs from an external source. Example: "we design a parallel cross-attention branch for HDiT injection."
- Denoising timestep: The time parameter in diffusion/flow models indicating the stage of the iterative denoising trajectory from noise to data. Example: "at a specific denoising timestep t [1,19,46]"
- Diffusion Transformer (DiT): A transformer architecture tailored to diffusion-style denoising, operating over latent tokens for generative processes. Example: "integrates Qwen2.5-VL with a Diffusion Trans- former (DiT)."
- DINOv3: A self-supervised vision model used to extract robust image features for downstream tasks. Example: "decoding this fully denoised VAE latent into an image and then extracting features via DINOv3 [41]"
- Elevation: The vertical angle of the camera relative to the object, controlling the viewpoint’s height. Example: "using uniform azimuth sampling with random elevation."
- Field of View (FoV): The angular extent of the observable scene captured by the camera, affecting perspective and scale. Example: "the field of view (FoV) is sampled from {35°, 50°, 85°, 105°, 135°}"
- Hidden states (MMDiT): Intermediate feature representations inside a diffusion transformer that encode evolving semantic and spatial information during denoising. Example: "directly using the hidden states from the inter- mediate layers of MMDiT during the denoising process."
- Intersection over Union (IoU): An overlap metric for comparing predicted and ground-truth regions/voxels, defined as the ratio of intersection to union. Example: "we evaluate the performance using IoU and Chamfer Distance (CD)."
- Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning method that injects low-rank trainable matrices into pre-trained model weights. Example: "using Low-Rank Adaptation(LoRA) [18]."
- MMDiT: A multimodal diffusion transformer backbone whose intermediate layers’ features can be exploited as structural-semantic priors. Example: "the hidden states of the intermediate layers of MMDiT inherently possess strong spatial awareness and rich semantic information [19,26]"
- Multimodal diffusion model: A diffusion-based generative model that operates over multiple modalities (e.g., text and images) for conditioning and generation. Example: "we leverage a multimodal diffusion model as an intermediate bridge"
- Multimodal LLMs (MLLMs): Large models trained on text and other modalities (e.g., images) to perform reasoning and generation across modalities. Example: "incorporate rich knowl- edge from Multimodal LLMs (MLLMs) into 3D gen- eration processes."
- Score distillation: A technique that transfers the guidance of a pre-trained 2D diffusion model (its score function) into optimizing 3D representations. Example: "pioneers score distillation from pre-trained 2D diffusion models to optimize 3D assets"
- Sparse voxel: A compact 3D representation storing data only at occupied voxels, enabling efficient modeling of complex shapes. Example: "and the Sparse Voxel approach [17, 30, 38, 57, 59, 60, 64]"
- ULIP: A metric/model for evaluating cross-modal alignment between language, images, and 3D point clouds. Example: "we use ULIP [62] and Uni3D [72] to measure the semantic consistency"
- Uni3D: A unified 3D representation/metric used to assess semantic consistency between images and generated meshes. Example: "we use ULIP [62] and Uni3D [72] to measure the semantic consistency"
- Variational Autoencoder (VAE): A probabilistic generative model that encodes data into a latent space and decodes it back, often used to obtain image latents. Example: "via a VAE encoder [51]"
- VecSet (Vector Set): A 3D latent representation paradigm encoding shapes as sets of vectors, emphasizing global structure and compression. Example: "the Vector Set (VecSet) [20,23,25,27-29, 66, 67, 70] approach"
- Vision-LLM (VLM): A model jointly trained on images and text to capture semantic relationships across modalities for understanding and generation. Example: "the Vision LLM (VLM) is used to provide high-level semantic understanding,"
- Zero-initialized linear layer: A stabilization technique where the weights of a newly added linear layer are initialized to zero so its effect grows gradually during training. Example: "Its output is scaled by a zero-initialized linear layer for stable training."
Collections
Sign up for free to add this paper to one or more collections.