PLP-Encoder in Structured Wireframe Parsing
- The paper introduces PLP-Encoder, which transforms independent point and line detections into spatial prompt maps to enhance geometric alignment.
- It leverages bi-directional interactions and sparse cross-attention mechanisms to refine junction and line proposals within the Co-PLNet framework.
- Empirical results show a 2.7 points sAP improvement and a significant reduction in endpoint mismatches, boosting its applicability in real-time SLAM and 3D reconstruction.
A Point-Line Prompt Encoder (PLP-Encoder) is a neural module devised to transform early, independent detections of junctions (points) and line segments into spatially aligned prompt maps encoding both semantic and geometric attributes. Introduced within the Co-PLNet framework for structured wireframe parsing, the PLP-Encoder is architected to facilitate bi-directional interaction between junction and line proposals, improving spatial consistency and robustness critical for downstream tasks such as Simultaneous Localization and Mapping (SLAM) (Wang et al., 26 Jan 2026).
1. Point-Line Interaction Rationale
Legacy wireframe parsers (e.g., L-CNN, HAWP, PLNet) perform independent detection of lines and junctions, subsequently merging their outputs via post-processing. This approach often produces endpoint–junction mismatches and impairs geometric integrity, which is detrimental to SLAM and 3D scene understanding tasks that leverage both primitives. The PLP-Encoder is designed to mitigate these deficiencies by allowing prompt-driven, collaborative refinement within the network. In Co-PLNet, initial coarse proposals from both domains are encoded to prompt maps by the PLP-Encoder and consumed by subsequent decoding modules, enabling mutual conditioning that enforces structural alignment.
2. PLP-Encoder Architectural Details
The PLP-Encoder produces two distinct prompt maps:
- The point prompt at spatial coordinate ;
- The line prompt at with orientation .
2.1 Junction Prompt Encoding
- Features are extracted by combining frozen SuperPoint and U-Net backbones.
- A per-pixel heatmap and offset prediction are computed to localize refined junctions:
- Junction confidence is normalized:
- Top- junctions are selected via non-maximum suppression (NMS) and thresholding, and re-scattered into a dense map.
- Final prompt encoding is performed by sequential 3×3 convolutions with ReLU activation:
yielding a $16$-channel spatial prompt.
2.2 Line Prompt Encoding
- HAFM-style per-pixel maps encode line parameters : perpendicular distance, tangent, endpoint angles, and sub-pixel refinement.
- Endpoint coordinates at pixel :
where is the rotation matrix in tangent orientation.
- Proposals are aggregated as .
- The line prompt encoding mirrors the junction formulation:
producing a $16$-channel map.
3. Integration with Cross-Guidance Decoding
PLP-Encoder outputs () serve as spatial prompts for the subsequent Cross-Guidance Line Decoder (CGL-Decoder). These prompts are concatenated, fused with local U-Net features, and then modulate cross-domain refinement via sparse multi-head attention mechanisms partitioned by spatial windows (optimal size for accuracy/speed equilibrium). Attentional updates are gated, leading to residual fusion and improved wireframe delineation. Endpoint-to-junction associations are performed within a $10$px neighborhood, and final wireframes are scored by a Line-of-Interest (LOI) MLP module.
4. Training Protocols and Loss Functions
All trainable components, except the frozen SuperPoint feature extractor, are optimized end-to-end under an aggregate loss:
where:
- : L1/Huber loss for HAFM parameters,
- : cross-entropy on junction heatmap and L2 offset regression,
- : auxiliary loss for geometric conformance,
- : binary cross-entropy for LOI confidence scores.
Optimization is performed using Adam (lr schedule , batch size 6, input resolution , prompt dimension 16, cross-attention dimension 32, four heads).
5. Empirical Performance and Ablation
Evaluations on Wireframe and YorkUrban datasets using metrics such as structural AP (sAP) at endpoint tolerance and endpoint mismatch rate demonstrate that PLP-Encoder integration yields:
- points sAP improvement over HAWPv2 at $15$px tolerance on Wireframe ($73.8$ vs.\ $71.3$),
- Substantial reduction in endpoint mismatch (),
- Near real-time efficiency (up to $76.8$ FPS on ).
Ablation studies confirm the mutual benefit of point-to-line and line-to-point prompt exchange, and validate the optimality of window size for sparse cross-attention. Presence of prompt maps reduces mismatch and boosts sAP, while sparse attention preserves throughput compared to dense alternatives.
| Configuration | sAP15 (Wireframe) | Endpoint Mismatch (%) | FPS |
|---|---|---|---|
| Baseline | 70.9 | 12.4 | 79.4 |
| + point-to-line prompt (PL) | 72.3 | 11.2 | 78.5 |
| + both prompts + sparse attention | 73.3 | 7.8 | 76.8 |
| + both prompts + dense attention | 73.6 | 7.5 | 42.1 |
6. Downstream Applications and Significance
The PLP-Encoder, as realized in Co-PLNet, enables robust wireframe parsing suitable for SLAM systems that require geometry-consistent line and point features. Significant sAP and mismatch improvements support enhanced performance in illumination-robust visual odometry pipelines and higher-level 3D reconstruction. A plausible implication is increased reliability for real-time structure-from-motion, robotics navigation, and indoor/outdoor scene parsing. Its methodology advances prompt-guided structured geometry perception, marking a trend toward the integration of 2D wireframe inference with 3D mapping workflows (Wang et al., 26 Jan 2026).