Co-PLNet: Collaborative Wireframe Parsing
- The paper presents Co-PLNet, a framework that jointly predicts junctions and line segments via spatial prompts to enforce geometric consistency.
- It integrates point and line detection in an end-to-end system using a prompt encoder and cross-guidance decoder, eliminating endpoint mismatches and redundant proposals.
- Empirical results show improved structural AP and reduced endpoint mismatches compared to prior methods, achieving real-time performance on benchmark datasets.
Co-PLNet is a collaborative neural architecture for wireframe parsing that enforces geometric consistency between predicted junctions and line segments by exchanging spatial cues (“spatial prompts”) early and continuously throughout processing. Wireframe parsing, which involves recovering both line segments and their junctions to form a structured geometric representation of a scene, is an essential precursor for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Co-PLNet integrates point and line detection into a unified, end-to-end paradigm, unlike prior frameworks that treat the two tasks as independent and merge them in post-processing, which often leads to mismatched endpoints, geometrical inconsistency, and increased latency (Wang et al., 26 Jan 2026).
1. Motivation and Point-Line Collaboration
Prior art in wireframe parsing (e.g., L-CNN, HAWP, PLNet) handles junction (point) detection and line detection as independent subtasks, generating junctions and lines separately and reconciling them in a later post-hoc merging stage. This sequential paradigm incurs two significant drawbacks: (1) geometric mismatches between line endpoints and nearby junctions, degrading the structural integrity of the parsed wireframes; (2) redundant proposals with heavy post-filtering, resulting in additional computational overhead.
Co-PLNet addresses these issues by enforcing joint reasoning across both subtasks. The approach factorizes the joint posterior over the final wireframe for image as: where and are coarse spatial prompts (see below) and enumerates possible line-junction pairings. Each task is thus conditioned on an early estimate from the other, enforcing point-line consistency from the initial stages and refining subsequent predictions through prompt-guided mutual influence.
2. Point-Line Prompt Encoder (PLP-Encoder)
The PLP-Encoder transforms coarse predictions of junctions and lines into dense, spatially aligned prompt maps that encode semantic geometry and are used to guide later decoder stages.
- Feature Extraction: The input image is processed with a SuperPoint backbone and a lightweight U-Net, resulting in a feature map .
- Junction Parsing / Point Prompt:
- Predicts a heatmap for junctions and sub-pixel offsets .
- Junction locations: .
- Confidence scores: .
- After NMS and thresholding, high-confidence junctions are embedded into a dense point prompt map using two cascaded convolutions (kernel sizes 3×3 and 1×1, channels).
- Line Parsing / Line Prompt:
- Using Holistically-Attracted Field Maps (HAFM), predicts 5 parameters per pixel to estimate line endpoints.
- Dense proposals are generated and converted via two convolutions into the line prompt ().
- Prompt Semantics:
- The point prompt encapsulates local junction density and endpoint distributions.
- The line prompt encodes orientation statistics and coarse connectivity.
Both prompts are spatially aligned to the input grid, enabling efficient cross-attention in subsequent stages.
3. Cross-Guidance Line Decoder (CGL-Decoder)
The CGL-Decoder implements mutual refinement of line and junction detections through local fusion and sparse cross-attention:
- Local Feature Fusion: Feature tensors for lines and junctions are constructed by concatenating raw features with their respective prompt maps ( or ) followed by convolution, yielding and .
- Sparse Multi-Head Cross-Attention: Within non-overlapping windows (size ), projected features undergo standard multi-head attention:
Here, is a convolutional projection.
- Gated Residual Fusion: To suppress noisy prompts, outputs are fused via learnable gates:
; denotes channel-wise multiplication.
- Final Parsing: Refined feature maps are used to re-predict lines and junctions. Line endpoints are associated with nearby junctions (within 10 pixels); duplicates are suppressed, and top- proposals are selected using a learned Line-Of-Interest (LOI) head.
4. Training Procedure and Loss Functions
Co-PLNet’s learnable components (PLP-Encoder excluding SuperPoint, CGL-Decoder, LOI head) are trained end-to-end.
- Loss Function: The total objective is:
- : L1 losses on HAFM parameters.
- : focal or L2 loss on junction heatmaps and L1 on sub-pixel offsets.
- : geometric consistency between dense proposals and ground truth.
- : cross-entropy for line-of-interest confidence.
- Optimization: Adam optimizer, initial learning rate (35 epochs), decayed to for the final 5 epochs. Parameters: batch size 6, image size , prompt dimension , sparse attention with 4 heads (), window size .
5. Evaluation Methodology and Results
- Datasets: Wireframe (5,000 train / 462 test), YorkUrban (102 test).
- Metrics:
- Structural Average Precision (sAP) at endpoint thresholds 5/10/15 pixels.
- Endpoint mismatch rate (% endpoints without a nearby junction, 15 pixels).
- Efficiency (FPS) on a single RTX 4080.
| Method | sAP⁵ | sAP¹⁰ | sAP¹⁵ | York sAP⁵ | York sAP¹⁰ | York sAP¹⁵ | FPS |
|---|---|---|---|---|---|---|---|
| HAWPv2 [23] | 65.7 | 69.7 | 71.3 | 28.9 | 31.2 | 32.6 | 85.2 |
| Co-PLNet | 68.4 | 72.3 | 73.8 | 32.7 | 35.6 | 36.6 | 76.8 |
Co-PLNet achieves +2.7 sAP⁵ over HAWPv2 on Wireframe and +3.8 on YorkUrban, with moderately reduced FPS (76.8 vs. 85.2). Inference is real-time at resolution.
6. Component Contribution: Ablation Analysis
Ablation studies on greyscale images quantify the contribution of each component:
| PL | LP | LF | SA | Dense Attn | Wire sAP¹⁵ | York sAP¹⁵ | Mis (%) | FPS |
|---|---|---|---|---|---|---|---|---|
| – | – | – | – | 70.9 | 33.5 | 12.4 | 79.4 | |
| ✓ | ✓ | 72.3 | 34.8 | 11.2 | 78.5 | |||
| ✓ | ✓ | 71.5 | 33.9 | 10.8 | 78.3 | |||
| ✓ | ✓ | ✓ | 72.6 | 35.1 | 9.6 | 77.6 | ||
| ✓ | ✓ | ✓ | ✓ | 73.3 | 36.4 | 7.8 | 76.8 | |
| ✓ | ✓ | ✓ | ✓ | 73.6 | 36.7 | 7.5 | 42.1 |
- PL: point→line prompt; LP: line→point prompt; LF: local fusion; SA: sparse attention.
- Both prompts jointly improve sAP¹⁵ by +1.7 and halve the mismatch rate.
- Sparse attention adds +0.7 sAP¹⁵ at minor speed cost.
- Dense attention marginally increases sAP but reduces FPS drastically.
7. Implications and Prospects
Co-PLNet demonstrates that prompt-guided interaction between line and point predictions improves both structural precision and computational efficiency for wireframe parsing. Notable gains include a reduction in endpoint mismatch rate (from 12.4% to 7.8%), structural AP improvements over SOTA, and real-time operation (76.8 FPS at ). A plausible implication is enhanced SLAM robustness in low-texture and challenging scenes, as geometric priors are better respected during inference. The collaborative design and efficient, end-to-end workflow make Co-PLNet suitable for deployment in robotics, scene layout estimation, and high-level vision tasks.
Implementation resources, including code and pretrained models, are publicly available at https://github.com/GalacticHogrider/Co-PLNet (Wang et al., 26 Jan 2026).