Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LP3: Language-Prompted Planar Priors

Updated 29 October 2025
  • The paper’s main contribution is the integration of language-prompted semantic segmentation and cross-view geometric fusion to robustly extract and refine planar structures in indoor scenes.
  • It introduces a pipeline that uses explicit planar supervision within the 3D Gaussian Splatting framework, yielding significant improvements in metrics such as Chamfer Distance, F1 score, and Normal Consistency.
  • LP3 overcomes photometric-only limitations by leveraging robust semantic prompts and dense depth priors, paving the way for enhanced accuracy in low-texture and cluttered environments.

Language-Prompted Planar Priors (LP3) constitute a vision-language-guided methodology for extracting and leveraging planar structure in indoor 3D reconstruction, specifically within the PlanarGS framework for 3D Gaussian Splatting (3DGS). LP3 integrates semantic segmentation via large foundation models, cross-view geometric fusion, and explicit supervision into the 3DGS optimization. This approach addresses core limitations of photometric-only optimization, which struggles to recover correct geometry in large, low-texture regions.

1. Conceptual Overview

LP3 introduces explicit priors for planar regions based on semantic cues derived from vision-language segmentation models, enabling robust detection and refinement of surfaces such as walls, floors, and doors. The method leverages semantic prompts (e.g., "wall", "floor") processed by a pretrained vision-LLM to segment planar regions across multi-view imagery. These initial priors are then geometrically verified and fused across views, ensuring physical and semantic consistency. The extracted planar and geometric priors are utilized via dedicated supervision terms during 3DGS optimization, resulting in reconstructions that maintain precise planarity and surface fidelity.

2. Vision-Language Extraction and Geometric Refinement

2.1 Vision-Language Segmentation

The pipeline begins with multi-view images and semantic prompts processed by a foundation model (GroundedSAM). For each prompt (e.g., "floor"), the model infers pixel-wise masks and bounding boxes, producing initial planar proposals.

2.2 Cross-View Fusion

Single image segmentation is often susceptible to occlusion and boundary artifacts. LP3 employs cross-view geometric fusion to overcome these limitations:

  • Planar pixels in a source image are back-projected to 3D using depth priors DrD_r and camera intrinsics K\mathbf{K}: Ps=Dr(ps)K1ps~\mathbf{P}_s = D_r(\mathbf{p}_s) \cdot \mathbf{K}^{-1} \widetilde{\mathbf{p}_s}.
  • The corresponding 3D points are transformed to the coordinate frame of target images, projected back to 2D: Pt=RtRsTPs+(ttRtRsTts)\mathbf{P}_t = \mathbf{R}_t \mathbf{R}_s^T \mathbf{P}_s + (\mathbf{t}_t - \mathbf{R}_t \mathbf{R}_s^T \mathbf{t}_s) and ztpt~=KtPtz_t \widetilde{\mathbf{p}_t} = \mathbf{K}_t \mathbf{P}_t.
  • This scheme exploits complimentary evidence across views to restore missed or partially segmented planes.

2.3 Geometric Priors

Planar proposals are further validated and refined with geometric cues derived from multi-view dense depth (from DUSt3R):

  • Normals are computed via local plane fitting: Ndr(p)=(P1P0)×(P3P2)(P1P0)×(P3P2)\mathbf{N}_{dr}(\mathbf{p}) = \frac{(\mathbf{P}_1 - \mathbf{P}_0) \times (\mathbf{P}_3 - \mathbf{P}_2)}{|(\mathbf{P}_1 - \mathbf{P}_0) \times (\mathbf{P}_3 - \mathbf{P}_2)|}.
  • Plane-distance maps: δr(p)=PNdr(p)\delta_r(\mathbf{p}) = \mathbf{P} \cdot \mathbf{N}_{dr}(\mathbf{p}).
  • K-means clustering discriminates non-parallel planes, while edge-detection via outliers in δr(p)\delta_r(\mathbf{p}) delineates boundaries even between parallel structures.
  • Non-planar and cluttered areas are filtered, yielding precise and reliable planar region assignments.

3. Injection into 3D Gaussian Splatting

The planar and geometric priors are integrated with the 3DGS pipeline through multiple regularization and supervision mechanisms:

3.1 Planar Prior Supervision

  • Plane-Guided Initialization: In regions with poor structure-from-motion support, Gaussian centroids are initialized by back-projecting masked planar pixels with prior depth. Existing points are relabeled to reflect planar assignments.
  • Gaussian Flattening: For each Gaussian encoded as N(μ,S)\mathcal{N}(\mu, S), the scale along one axis is regularized to small values enforcing flattening: Ls=min(s1,s2,s3)1L_{s} = \left\| \min(s_1, s_2, s_3) \right\|_1.
  • Co-Planarity Constraint: Global planar models are locally fit: Am=(QnTQn+ϵE)1QnTYm\mathbf{A}_m = (\mathbf{Q}_n^T \mathbf{Q}_n + \epsilon \mathbf{E})^{-1} \mathbf{Q}_n^T \mathbf{Y}_m, and planar depth supervision is imposed: Lp=1NpppDp(p)D^(p)1L_p = \frac{1}{N_p} \sum_{\mathbf{p} \in p} \left| D_p(\mathbf{p}) - \hat{D}(\mathbf{p}) \right|_1.

3.2 Geometric Prior Supervision

  • Depth Prior: The multi-view depth prior DrD_r supervises the rendered depth D^\hat{D} in low-texture regions: Lrd=1NltpltMcof(p)Dr(p)D^(p)2L_{rd} = \frac{1}{N_{lt}} \sum_{\mathbf{p} \in lt} M_{cof}(\mathbf{p}) \cdot \left\| D_r(\mathbf{p}) - \hat{D}(\mathbf{p}) \right\|^2.
  • Normal Prior: The rendered normal Nd^\hat{\mathbf{N}_d} is aligned with Ndr\mathbf{N}_{dr}: Lrn=NdrNd^1+(1NdrNd^)L_{rn} = \left\| \mathbf{N}_{dr} - \hat{\mathbf{N}_d} \right\|_1 + (1 - \mathbf{N}_{dr} \cdot \hat{\mathbf{N}_d}).
  • Depth-Normal Consistency: Internal consistency within 3DGS is enforced: Ldn=1NltpltNd^(p)N^(p)1L_{dn} = \frac{1}{N_{lt}} \sum_{\mathbf{p} \in lt} \left\| \hat{\mathbf{N}_d}(\mathbf{p}) - \hat{\mathbf{N}}(\mathbf{p}) \right\|_1.

3.3 Comprehensive Loss Function

All supervision terms are incorporated into the total loss: Ltotal=LRGB+Ls+λ1Ldn+λ2Lp+λ3Lrd+λ4LrnL_{\text{total}} = L_{RGB} + L_s + \lambda_1 L_{dn} + \lambda_2 L_p + \lambda_3 L_{rd} + \lambda_4 L_{rn} where LRGBL_{RGB} is the photometric supervision (L1 + D-SSIM), LsL_s is scale regularization, and λ\lambda are experimentally selected hyperparameters (0.05,0.5,0.05,0.2)(0.05, 0.5, 0.05, 0.2).

4. Evaluation and Benchmark Performance

Comprehensive experiments on Replica, ScanNet++, and MuSHRoom datasets demonstrate the substantial quantitative and qualitative benefits of LP3-enhanced PlanarGS.

Dataset Acc↓ Comp↓ CD↓ F1↑ NC↑
MuSHRoom 3.95 5.02 4.49 77.14 83.35
DN-Splatter 6.25 5.29 5.77 61.86 77.13
PGSR 7.52 13.50 10.51 59.11 73.27
2DGS 9.16 10.27 9.71 51.50 73.65
3DGS Classic 12.01 11.85 11.92 38.53 62.00

PlanarGS achieves marked improvements across Chamfer Distance, F1, and Normal Consistency metrics, particularly in highly planar, low-texture regions. Competing methods often exhibit bending artifacts or fail to maintain planarity, while PlanarGS’s LP3 pipeline secures both global and local geometric accuracy.

Ablation studies confirm the necessity of LP3’s core components. Removal of planar priors or geometric supervision consistently degrades surface quality and metric scores. The method also exhibits robustness to alternate prompt choices or segmentation foundation models.

5. Broader Implications and Limitations

The LP3 methodology explicitly addresses a key challenge in indoor 3D reconstruction: reliably recovering extensive planar surfaces that lack photometric texture. By integrating semantic vision-language segmentation, multi-view geometric fusion, and direct supervision, LP3 enables volumetric models like 3DGS to achieve superior accuracy and realism in such environments. A plausible implication is that similar strategies may generalize to other environments or object classes where prior knowledge constrains geometric structure.

Despite performance gains, the method relies on the quality of segmentation and depth priors; segmentation errors or inaccurate depth predictions may propagate through the pipeline. Additionally, the process assumes that vision-language foundation models can robustly segment relevant planar surfaces under diverse conditions.

6. Relation to Prior Work

LP3 builds upon previous advances in 3DGS, vision-language segmentation, and geometric priors. It utilizes GroundedSAM for segmentation and DUSt3R for dense multi-view depth/normal estimation. Compared to approaches relying solely on photometric or local geometric constraints, LP3 demonstrates clear improvements in reconstructing planar regions, as substantiated by its comparative evaluation.

The explicit fusion of semantic and geometric cues for surface regularization marks a significant methodological advance over prior art, which either mitigates but does not eliminate bending artifacts or fails to recover planar geometry in low-texture regimes.

7. Prospects and Generalization

LP3’s fusion of semantic and geometric priors via language-prompted supervision suggests a broader trajectory for integrating high-level vision-language reasoning with dense geometric reconstruction. The robustness of the approach under varying prompts and foundation model choices indicates its generalizability and adaptability. Future extensions might incorporate additional semantic classes, refined geometric reasoning for non-planar structures, or more sophisticated multi-view data association.

LP3 establishes a template for leveraging vision-language and geometric priors for challenging scene reconstruction tasks, with demonstrated efficacy in high-fidelity planar surface recovery.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Language-Prompted Planar Priors (LP3).