Co-PLNet: Collaborative Wireframe Parsing

Updated 29 January 2026

The paper presents Co-PLNet, a framework that jointly predicts junctions and line segments via spatial prompts to enforce geometric consistency.
It integrates point and line detection in an end-to-end system using a prompt encoder and cross-guidance decoder, eliminating endpoint mismatches and redundant proposals.
Empirical results show improved structural AP and reduced endpoint mismatches compared to prior methods, achieving real-time performance on benchmark datasets.

Co-PLNet is a collaborative neural architecture for wireframe parsing that enforces geometric consistency between predicted junctions and line segments by exchanging spatial cues (“spatial prompts”) early and continuously throughout processing. Wireframe parsing, which involves recovering both line segments and their junctions to form a structured geometric representation of a scene, is an essential precursor for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Co-PLNet integrates point and line detection into a unified, end-to-end paradigm, unlike prior frameworks that treat the two tasks as independent and merge them in post-processing, which often leads to mismatched endpoints, geometrical inconsistency, and increased latency (Wang et al., 26 Jan 2026).

1. Motivation and Point-Line Collaboration

Prior art in wireframe parsing (e.g., L-CNN, HAWP, PLNet) handles junction (point) detection and line detection as independent subtasks, generating junctions and lines separately and reconciling them in a later post-hoc merging stage. This sequential paradigm incurs two significant drawbacks: (1) geometric mismatches between line endpoints and nearby junctions, degrading the structural integrity of the parsed wireframes; (2) redundant proposals with heavy post-filtering, resulting in additional computational overhead.

Co-PLNet addresses these issues by enforcing joint reasoning across both subtasks. The approach factorizes the joint posterior over the final wireframe $y = \{y_J, y_L\}$ for image $I$ as: $p(y\mid I) = \sum_{(y_J, y_L) \in \mathcal{S}(y)} p_J(y_J \mid I,\, y_L^{(0)})\;\times\; p_L(y_L \mid I,\, y_J^{(0)})$ where $y_J^{(0)}$ and $y_L^{(0)}$ are coarse spatial prompts (see below) and $\mathcal{S}(y)$ enumerates possible line-junction pairings. Each task is thus conditioned on an early estimate from the other, enforcing point-line consistency from the initial stages and refining subsequent predictions through prompt-guided mutual influence.

2. Point-Line Prompt Encoder (PLP-Encoder)

The PLP-Encoder transforms coarse predictions of junctions and lines into dense, spatially aligned prompt maps that encode semantic geometry and are used to guide later decoder stages.

Feature Extraction: The input image $I$ is processed with a SuperPoint backbone and a lightweight U-Net, resulting in a feature map $Z \in \mathbb{R}^{H \times W \times C}$ .
Junction Parsing / Point Prompt:
- Predicts a heatmap $H_J \in \mathbb{R}^{H \times W}$ for junctions and sub-pixel offsets $\Delta C_J \in \mathbb{R}^{H \times W \times 2}$ .
- Junction locations: $C_J(x, y) = [x, y] + \Delta C_J(x, y)$ .
- Confidence scores: $V_J(x, y) = \text{softmax}(H_J(x, y))$ .
- After NMS and thresholding, high-confidence junctions $\{C_J, V_J\}$ are embedded into a dense point prompt map $y_J^{(0)} \in \mathbb{R}^{H\times W\times d_p}$ using two cascaded convolutions (kernel sizes 3×3 and 1×1, $d_p = 16$ channels).
Line Parsing / Line Prompt:
- Using Holistically-Attracted Field Maps (HAFM), predicts 5 parameters per pixel $(d, \theta, \theta_1, \theta_2, r)$ to estimate line endpoints.
- Dense proposals $C_L \in \mathbb{R}^{H \times W \times 4}$ are generated and converted via two convolutions into the line prompt $y_L^{(0)} \in \mathbb{R}^{H\times W\times d_l}$ ( $d_l = 16$ ).
Prompt Semantics:
- The point prompt encapsulates local junction density and endpoint distributions.
- The line prompt encodes orientation statistics and coarse connectivity.

Both prompts are spatially aligned to the input grid, enabling efficient cross-attention in subsequent stages.

3. Cross-Guidance Line Decoder (CGL-Decoder)

The CGL-Decoder implements mutual refinement of line and junction detections through local fusion and sparse cross-attention:

Local Feature Fusion: Feature tensors for lines and junctions are constructed by concatenating raw features $Z$ with their respective prompt maps ( $y_L^{(0)}$ or $y_J^{(0)}$ ) followed by convolution, yielding $\tilde{Z}_L$ and $\tilde{Z}_J$ .
Sparse Multi-Head Cross-Attention: Within non-overlapping windows (size $w \times w$ ), projected features undergo standard multi-head attention:

$\bar{Z}_L = \mathrm{MHA}(\psi(\tilde{Z}_L), \psi(Z), \psi(Z))$

$\bar{Z}_J = \mathrm{MHA}(\psi(\tilde{Z}_J), \psi(Z), \psi(Z))$

Here, $\psi(\cdot)$ is a $1 \times 1$ convolutional projection.

Gated Residual Fusion: To suppress noisy prompts, outputs are fused via learnable gates:

$Z'_L = Z + G_L \odot \bar{Z}_L,\quad Z'_J = Z + G_J \odot \bar{Z}_J$

$G_L, G_J \in [0,1]^{H \times W \times C}$ ; $\odot$ denotes channel-wise multiplication.

Final Parsing: Refined feature maps are used to re-predict lines and junctions. Line endpoints are associated with nearby junctions (within 10 pixels); duplicates are suppressed, and top- $k$ proposals are selected using a learned Line-Of-Interest (LOI) head.

4. Training Procedure and Loss Functions

Co-PLNet’s learnable components (PLP-Encoder excluding SuperPoint, CGL-Decoder, LOI head) are trained end-to-end.

Loss Function: The total objective is: $\mathcal{L} = \sum_{m \in \{\mathrm{PLP},\,\mathrm{CGL}\}} \Bigl(\mathcal{L}_\mathrm{line}^{(m)} + \mathcal{L}_\mathrm{junc}^{(m)} + \mathcal{L}_\mathrm{aux}^{(m)}\Bigr) + \mathcal{L}_{\mathrm{LOI}}$
- $\mathcal{L}_\mathrm{line}^{(m)}$ : L1 losses on HAFM parameters.
- $\mathcal{L}_\mathrm{junc}^{(m)}$ : focal or L2 loss on junction heatmaps and L1 on sub-pixel offsets.
- $\mathcal{L}_\mathrm{aux}^{(m)}$ : geometric consistency between dense proposals and ground truth.
- $\mathcal{L}_{\mathrm{LOI}}$ : cross-entropy for line-of-interest confidence.
Optimization: Adam optimizer, initial learning rate $4 \times 10^{-4}$ (35 epochs), decayed to $4 \times 10^{-5}$ for the final 5 epochs. Parameters: batch size 6, image size $512 \times 512$ , prompt dimension $d_p = d_l = 16$ , sparse attention with 4 heads ( $d = 32$ ), window size $w = 8$ .

5. Evaluation Methodology and Results

Datasets: Wireframe (5,000 train / 462 test), YorkUrban (102 test).
Metrics:
- Structural Average Precision (sAP) at endpoint thresholds 5/10/15 pixels.
- Endpoint mismatch rate (% endpoints without a nearby junction, $<$ 15 pixels).
- Efficiency (FPS) on a single RTX 4080.

Method	sAP⁵	sAP¹⁰	sAP¹⁵	York sAP⁵	York sAP¹⁰	York sAP¹⁵	FPS
HAWPv2 [23]	65.7	69.7	71.3	28.9	31.2	32.6	85.2
Co-PLNet	68.4	72.3	73.8	32.7	35.6	36.6	76.8

Co-PLNet achieves +2.7 sAP⁵ over HAWPv2 on Wireframe and +3.8 on YorkUrban, with moderately reduced FPS (76.8 vs. 85.2). Inference is real-time at $512 \times 512$ resolution.

6. Component Contribution: Ablation Analysis

Ablation studies on greyscale images quantify the contribution of each component:

PL	LP	LF	SA	Dense Attn	Wire sAP¹⁵	York sAP¹⁵	Mis (%)	FPS
–	–	–	–		70.9	33.5	12.4	79.4
✓		✓			72.3	34.8	11.2	78.5
	✓	✓			71.5	33.9	10.8	78.3
✓	✓	✓			72.6	35.1	9.6	77.6
✓	✓	✓	✓		73.3	36.4	7.8	76.8
✓	✓	✓		✓	73.6	36.7	7.5	42.1

PL: point→line prompt; LP: line→point prompt; LF: local fusion; SA: sparse attention.
Both prompts jointly improve sAP¹⁵ by +1.7 and halve the mismatch rate.
Sparse attention adds +0.7 sAP¹⁵ at minor speed cost.
Dense attention marginally increases sAP but reduces FPS drastically.

7. Implications and Prospects

Co-PLNet demonstrates that prompt-guided interaction between line and point predictions improves both structural precision and computational efficiency for wireframe parsing. Notable gains include a reduction in endpoint mismatch rate (from 12.4% to 7.8%), structural AP improvements over SOTA, and real-time operation (76.8 FPS at $512 \times 512$ ). A plausible implication is enhanced SLAM robustness in low-texture and challenging scenes, as geometric priors are better respected during inference. The collaborative design and efficient, end-to-end workflow make Co-PLNet suitable for deployment in robotics, scene layout estimation, and high-level vision tasks.

Implementation resources, including code and pretrained models, are publicly available at https://github.com/GalacticHogrider/Co-PLNet (Wang et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Co-PLNet.

Co-PLNet: Collaborative Wireframe Parsing

1. Motivation and Point-Line Collaboration

2. Point-Line Prompt Encoder (PLP-Encoder)

3. Cross-Guidance Line Decoder (CGL-Decoder)

4. Training Procedure and Loss Functions

5. Evaluation Methodology and Results

6. Component Contribution: Ablation Analysis

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Co-PLNet: Collaborative Wireframe Parsing

1. Motivation and Point-Line Collaboration

2. Point-Line Prompt Encoder (PLP-Encoder)

3. Cross-Guidance Line Decoder (CGL-Decoder)

4. Training Procedure and Loss Functions

5. Evaluation Methodology and Results

6. Component Contribution: Ablation Analysis

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research