Papers
Topics
Authors
Recent
Search
2000 character limit reached

Co-PLNet: Collaborative Wireframe Parsing

Updated 29 January 2026
  • The paper presents Co-PLNet, a framework that jointly predicts junctions and line segments via spatial prompts to enforce geometric consistency.
  • It integrates point and line detection in an end-to-end system using a prompt encoder and cross-guidance decoder, eliminating endpoint mismatches and redundant proposals.
  • Empirical results show improved structural AP and reduced endpoint mismatches compared to prior methods, achieving real-time performance on benchmark datasets.

Co-PLNet is a collaborative neural architecture for wireframe parsing that enforces geometric consistency between predicted junctions and line segments by exchanging spatial cues (“spatial prompts”) early and continuously throughout processing. Wireframe parsing, which involves recovering both line segments and their junctions to form a structured geometric representation of a scene, is an essential precursor for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Co-PLNet integrates point and line detection into a unified, end-to-end paradigm, unlike prior frameworks that treat the two tasks as independent and merge them in post-processing, which often leads to mismatched endpoints, geometrical inconsistency, and increased latency (Wang et al., 26 Jan 2026).

1. Motivation and Point-Line Collaboration

Prior art in wireframe parsing (e.g., L-CNN, HAWP, PLNet) handles junction (point) detection and line detection as independent subtasks, generating junctions and lines separately and reconciling them in a later post-hoc merging stage. This sequential paradigm incurs two significant drawbacks: (1) geometric mismatches between line endpoints and nearby junctions, degrading the structural integrity of the parsed wireframes; (2) redundant proposals with heavy post-filtering, resulting in additional computational overhead.

Co-PLNet addresses these issues by enforcing joint reasoning across both subtasks. The approach factorizes the joint posterior over the final wireframe y={yJ,yL}y = \{y_J, y_L\} for image II as: p(yI)=(yJ,yL)S(y)pJ(yJI,yL(0))  ×  pL(yLI,yJ(0))p(y\mid I) = \sum_{(y_J, y_L) \in \mathcal{S}(y)} p_J(y_J \mid I,\, y_L^{(0)})\;\times\; p_L(y_L \mid I,\, y_J^{(0)}) where yJ(0)y_J^{(0)} and yL(0)y_L^{(0)} are coarse spatial prompts (see below) and S(y)\mathcal{S}(y) enumerates possible line-junction pairings. Each task is thus conditioned on an early estimate from the other, enforcing point-line consistency from the initial stages and refining subsequent predictions through prompt-guided mutual influence.

2. Point-Line Prompt Encoder (PLP-Encoder)

The PLP-Encoder transforms coarse predictions of junctions and lines into dense, spatially aligned prompt maps that encode semantic geometry and are used to guide later decoder stages.

  • Feature Extraction: The input image II is processed with a SuperPoint backbone and a lightweight U-Net, resulting in a feature map ZRH×W×CZ \in \mathbb{R}^{H \times W \times C}.
  • Junction Parsing / Point Prompt:
    • Predicts a heatmap HJRH×WH_J \in \mathbb{R}^{H \times W} for junctions and sub-pixel offsets ΔCJRH×W×2\Delta C_J \in \mathbb{R}^{H \times W \times 2}.
    • Junction locations: CJ(x,y)=[x,y]+ΔCJ(x,y)C_J(x, y) = [x, y] + \Delta C_J(x, y).
    • Confidence scores: VJ(x,y)=softmax(HJ(x,y))V_J(x, y) = \text{softmax}(H_J(x, y)).
    • After NMS and thresholding, high-confidence junctions {CJ,VJ}\{C_J, V_J\} are embedded into a dense point prompt map yJ(0)RH×W×dpy_J^{(0)} \in \mathbb{R}^{H\times W\times d_p} using two cascaded convolutions (kernel sizes 3×3 and 1×1, dp=16d_p = 16 channels).
  • Line Parsing / Line Prompt:
    • Using Holistically-Attracted Field Maps (HAFM), predicts 5 parameters per pixel (d,θ,θ1,θ2,r)(d, \theta, \theta_1, \theta_2, r) to estimate line endpoints.
    • Dense proposals CLRH×W×4C_L \in \mathbb{R}^{H \times W \times 4} are generated and converted via two convolutions into the line prompt yL(0)RH×W×dly_L^{(0)} \in \mathbb{R}^{H\times W\times d_l} (dl=16d_l = 16).
  • Prompt Semantics:
    • The point prompt encapsulates local junction density and endpoint distributions.
    • The line prompt encodes orientation statistics and coarse connectivity.

Both prompts are spatially aligned to the input grid, enabling efficient cross-attention in subsequent stages.

3. Cross-Guidance Line Decoder (CGL-Decoder)

The CGL-Decoder implements mutual refinement of line and junction detections through local fusion and sparse cross-attention:

  • Local Feature Fusion: Feature tensors for lines and junctions are constructed by concatenating raw features ZZ with their respective prompt maps (yL(0)y_L^{(0)} or yJ(0)y_J^{(0)}) followed by convolution, yielding Z~L\tilde{Z}_L and Z~J\tilde{Z}_J.
  • Sparse Multi-Head Cross-Attention: Within non-overlapping windows (size w×ww \times w), projected features undergo standard multi-head attention:

ZˉL=MHA(ψ(Z~L),ψ(Z),ψ(Z))\bar{Z}_L = \mathrm{MHA}(\psi(\tilde{Z}_L), \psi(Z), \psi(Z))

ZˉJ=MHA(ψ(Z~J),ψ(Z),ψ(Z))\bar{Z}_J = \mathrm{MHA}(\psi(\tilde{Z}_J), \psi(Z), \psi(Z))

Here, ψ()\psi(\cdot) is a 1×11 \times 1 convolutional projection.

  • Gated Residual Fusion: To suppress noisy prompts, outputs are fused via learnable gates:

ZL=Z+GLZˉL,ZJ=Z+GJZˉJZ'_L = Z + G_L \odot \bar{Z}_L,\quad Z'_J = Z + G_J \odot \bar{Z}_J

GL,GJ[0,1]H×W×CG_L, G_J \in [0,1]^{H \times W \times C}; \odot denotes channel-wise multiplication.

  • Final Parsing: Refined feature maps are used to re-predict lines and junctions. Line endpoints are associated with nearby junctions (within 10 pixels); duplicates are suppressed, and top-kk proposals are selected using a learned Line-Of-Interest (LOI) head.

4. Training Procedure and Loss Functions

Co-PLNet’s learnable components (PLP-Encoder excluding SuperPoint, CGL-Decoder, LOI head) are trained end-to-end.

  • Loss Function: The total objective is: L=m{PLP,CGL}(Lline(m)+Ljunc(m)+Laux(m))+LLOI\mathcal{L} = \sum_{m \in \{\mathrm{PLP},\,\mathrm{CGL}\}} \Bigl(\mathcal{L}_\mathrm{line}^{(m)} + \mathcal{L}_\mathrm{junc}^{(m)} + \mathcal{L}_\mathrm{aux}^{(m)}\Bigr) + \mathcal{L}_{\mathrm{LOI}}
    • Lline(m)\mathcal{L}_\mathrm{line}^{(m)}: L1 losses on HAFM parameters.
    • Ljunc(m)\mathcal{L}_\mathrm{junc}^{(m)}: focal or L2 loss on junction heatmaps and L1 on sub-pixel offsets.
    • Laux(m)\mathcal{L}_\mathrm{aux}^{(m)}: geometric consistency between dense proposals and ground truth.
    • LLOI\mathcal{L}_{\mathrm{LOI}}: cross-entropy for line-of-interest confidence.
  • Optimization: Adam optimizer, initial learning rate 4×1044 \times 10^{-4} (35 epochs), decayed to 4×1054 \times 10^{-5} for the final 5 epochs. Parameters: batch size 6, image size 512×512512 \times 512, prompt dimension dp=dl=16d_p = d_l = 16, sparse attention with 4 heads (d=32d = 32), window size w=8w = 8.

5. Evaluation Methodology and Results

  • Datasets: Wireframe (5,000 train / 462 test), YorkUrban (102 test).
  • Metrics:
    • Structural Average Precision (sAP) at endpoint thresholds 5/10/15 pixels.
    • Endpoint mismatch rate (% endpoints without a nearby junction, <<15 pixels).
    • Efficiency (FPS) on a single RTX 4080.
Method sAP⁵ sAP¹⁰ sAP¹⁵ York sAP⁵ York sAP¹⁰ York sAP¹⁵ FPS
HAWPv2 [23] 65.7 69.7 71.3 28.9 31.2 32.6 85.2
Co-PLNet 68.4 72.3 73.8 32.7 35.6 36.6 76.8

Co-PLNet achieves +2.7 sAP⁵ over HAWPv2 on Wireframe and +3.8 on YorkUrban, with moderately reduced FPS (76.8 vs. 85.2). Inference is real-time at 512×512512 \times 512 resolution.

6. Component Contribution: Ablation Analysis

Ablation studies on greyscale images quantify the contribution of each component:

PL LP LF SA Dense Attn Wire sAP¹⁵ York sAP¹⁵ Mis (%) FPS
70.9 33.5 12.4 79.4
72.3 34.8 11.2 78.5
71.5 33.9 10.8 78.3
72.6 35.1 9.6 77.6
73.3 36.4 7.8 76.8
73.6 36.7 7.5 42.1
  • PL: point→line prompt; LP: line→point prompt; LF: local fusion; SA: sparse attention.
  • Both prompts jointly improve sAP¹⁵ by +1.7 and halve the mismatch rate.
  • Sparse attention adds +0.7 sAP¹⁵ at minor speed cost.
  • Dense attention marginally increases sAP but reduces FPS drastically.

7. Implications and Prospects

Co-PLNet demonstrates that prompt-guided interaction between line and point predictions improves both structural precision and computational efficiency for wireframe parsing. Notable gains include a reduction in endpoint mismatch rate (from 12.4% to 7.8%), structural AP improvements over SOTA, and real-time operation (76.8 FPS at 512×512512 \times 512). A plausible implication is enhanced SLAM robustness in low-texture and challenging scenes, as geometric priors are better respected during inference. The collaborative design and efficient, end-to-end workflow make Co-PLNet suitable for deployment in robotics, scene layout estimation, and high-level vision tasks.

Implementation resources, including code and pretrained models, are publicly available at https://github.com/GalacticHogrider/Co-PLNet (Wang et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Co-PLNet.