Papers
Topics
Authors
Recent
Search
2000 character limit reached

PolyR-CNN: End-to-End Polygon Detection

Updated 30 March 2026
  • PolyR-CNN is an end-to-end polygon-capable R-CNN that directly predicts building outlines and bounding boxes from remote sensing imagery.
  • It introduces a novel vertex proposal feature that adaptively modulates RoI features based on polygon geometry to refine detection.
  • The method achieves state-of-the-art accuracy with faster inference speeds while supporting complex structures including buildings with holes.

PolyR-CNN is an end-to-end, polygon-capable Region-based Convolutional Neural Network (R-CNN) explicitly designed to predict both bounding boxes and vectorized polygonal building outlines directly from remotely sensed imagery. This method avoids the conventional multi-stage and complex specialized architectures typical of prior work, using a unified pipeline that leverages only Region of Interest (RoI) features and a novel vertex proposal feature scheme to directly regress ordered building polygons alongside standard detection outputs. PolyR-CNN demonstrates state-of-the-art efficiency and accuracy trade-offs in large-scale benchmarks, supports extraction of buildings with holes, and is agnostic to the use of semantic segmentation priors (Jiao et al., 2024).

1. End-to-End Architecture and Workflow

PolyR-CNN processes an input image through a standard backbone (e.g., ResNet-50, Swin-Base) with a Feature Pyramid Network (FPN), producing multi-scale features P2,…,P5P_2,\ldots,P_5. The method initializes NN proposal boxes (centered at (0.5,0.5)(0.5, 0.5), size (1,1)(1, 1)) and corresponding NN proposal polygons (MM uniformly distributed vertices).

For L=6L=6 sequential layers, the following steps are performed:

  • RoIAlign: Extracts RoI feature Froi∈RCĂ—hĂ—wF_{\mathrm{roi}} \in \mathbb{R}^{C \times h \times w} for each proposal box.
  • Vertex Proposal Feature: Computes a compact representation fvtx∈RCf_\mathrm{vtx} \in \mathbb{R}^C from the current proposal polygon via a feedforward network (FFN) and a self-attention block.
  • Feature Guidance: Dynamically modulates two 1Ă—11\times1 convolutional layers on FroiF_{\mathrm{roi}} using parameters from fvtxf_\mathrm{vtx}, followed by a self-attention block for interaction among all NN RoI features.
  • Prediction Heads (MLPs): Four parallel heads output the class score s∈[0,1]s \in [0, 1], box refinement Δb=(Δx,Δy,Δw,Δh)\Delta b = (\Delta x, \Delta y, \Delta w, \Delta h), polygon coordinates {(xi,yi)}i=1M\{(x_i, y_i)\}_{i=1}^M, and vertex-validity scores {ci}i=1M\{c_i\}_{i=1}^M.

After the final layer, proposals with s<0.05s < 0.05 are discarded, and vertices with ci<0.5c_i < 0.5 are pruned to produce the output polygons.

Process Flow Diagram (per layer):

1
2
3
4
Proposal polygon {(x_i, y_i)} 
   → FFN + SA → f_vtx 
      → DynamicConv(F_roi; f_vtx) → SA 
          → {classification, box, polygon, vtx_cls}
This enables direct, recurrently refined regression of complete polygons from features localized to the proposal region.

2. Vertex Proposal Feature Construction

At each refinement layer tt, the current polygon V(t)={vi(t)=(xi,yi)}i=1MV^{(t)} = \{v_i^{(t)} = (x_i, y_i)\}_{i=1}^M is flattened and mapped to a high-dimensional vector: z=GELU(W1 vec(V(t))+b1)∈Rd1z = \mathrm{GELU}(W_1 \, \mathrm{vec}(V^{(t)}) + b_1) \quad\in\mathbb{R}^{d_1}

fvtx(t)=SA(W2z+b2)∈RCf_{\rm vtx}^{(t)} = \mathrm{SA}(W_2 z + b_2) \quad\in\mathbb{R}^{C}

with d1=C=256d_1 = C = 256. This vertex feature fvtxf_{\rm vtx} generates dynamic kernel weights for 1Ă—11\times1 convolutions, adaptively conditioning the RoI feature on polygon geometry. The modulated feature is: {K(t),bK(t)}=Linear(fvtx(t)),Fguided(t)=SA(Conv1Ă—1(Froi;K(t),bK(t)))\{K^{(t)}, b_K^{(t)}\} = \mathrm{Linear}(f_{\mathrm{vtx}}^{(t)}), \quad F_{\rm guided}^{(t)} = \mathrm{SA}(\mathrm{Conv}_{1\times 1}(F_{\rm roi}; K^{(t)}, b_K^{(t)})) This guidance sharpens attention around the predicted polygon structure, empirically focusing features at corners and relevant vertices.

3. Training Losses and Hungarian Assignment

Training relies on a set-based Hungarian matching between NN predictions and GG ground-truth polygons, minimizing the following cost for each pair: C(s^,b^;s,b)=λclsLcls(s^,s)+λboxL1(b^,b)+λgiouLGIoU(b^,b)C(\hat{s}, \hat{b}; s, b) = \lambda_{\rm cls} L_{\rm cls}(\hat{s}, s) + \lambda_{\rm box} L_1(\hat{b}, b) + \lambda_{\rm giou} L_{\rm GIoU}(\hat{b}, b) where LclsL_{\rm cls} is a focal loss for classification, L1L_1 is smooth-L1L_1 or L1L_1 box regression loss, and LGIoUL_{\rm GIoU} is the generalized IoU loss.

The full objective over all NN predictions is: L=λclsLcls+λboxL1+λgiouLGIoU+λpoly∑i=1M∥v^i−vi∥1+λvtx∑i=1MLcls(c^i,ci)\mathcal{L} = \lambda_{\rm cls} L_{\rm cls} + \lambda_{\rm box} L_1 + \lambda_{\rm giou} L_{\rm GIoU} + \lambda_{\rm poly}\sum_{i=1}^M\lVert \hat v_i - v_i\rVert_1 + \lambda_{\rm vtx}\sum_{i=1}^M L_{\rm cls}(\hat c_i, c_i) with typical loss weights: λcls=2\lambda_{\rm cls}=2, λbox=5\lambda_{\rm box}=5, λgiou=2\lambda_{\rm giou}=2, λpoly=5\lambda_{\rm poly}=5, λvtx=1\lambda_{\rm vtx}=1.

Set-based matching is conducted as in DETR and SparseR-CNN, consistent with recent object detection literature.

4. Handling Structures With Holes

PolyR-CNN treats each valid polygon (outer or hole) as a separate instance. At inference, hole polygons are merged into their parent via spatial-join: for each predicted hole HH, if a larger polygon OO exists such that the centroid of HH lies within OO, HH is assigned as a hole of OO. The final outline is given by set difference O∖HO \setminus H. Point-in-polygon tests use winding-number or ray-crossing methods.

5. Empirical Evaluation and Comparison

PolyR-CNN is benchmarked on the CrowdAI and Inria datasets:

CrowdAI (300Ă—300 patches, MS-COCO format)

Method Backbone AP AP₅₀ AP₇₅ AP_boundary
PolyBuilding '22 ResNet-50 78.7 96.3 89.2 –
PolyWorld '22 R2U-Net 63.3 88.6 70.5 50.0
PolyR-CNN Swin-Base 79.2 97.4 90.0 63.3
PolyR-CNN ResNet-50 71.1 93.8 82.9 50.0
Method Backbone Epochs FLOPs (G) FPS
PolyWorld '22 – – 181.23 8.4
PolyBuilding '22 ResNet-50 200 21.45 14.3
PolyR-CNN ResNet-50 100 21.91 32.7
PolyR-CNN Swin-Base 100 46.55 20.5

With ResNet-50, PolyR-CNN achieves 71.1 AP at 32.7 FPS (4Ă— faster than PolyWorld) in 100 epochs. Using Swin-Base, it reaches 79.2 AP, matching the most accurate methods while operating at more than twice their speed.

Inria (512Ă—512, mask-IoU evaluation)

Method Needs sem-seg? IoU Acc. FPS
Zorzi et al. '19 ✓ 59.81 93.92 –
HiSup '22 ✓ 75.53 96.27 16.4
PolyR-CNN – 68.35 95.09 35.7

PolyR-CNN, without any semantic segmentation prior, surpasses classical polygonization in accuracy and more than doubles the speed of the best segmentation-based pipeline.

6. Implementation Specifics

Key components and parameters:

  • Backbones: ResNet-50 (ImageNet-1K), ResNeXt-101, Swin-Base (ImageNet-21K).
  • FPN: C=256 channels, pyramid levels P2–P5.
  • Proposals: N=100,M=96N=100, M=96 (CrowdAI); N=300,M=50N=300, M=50 (Inria).
  • Optimization: AdamW, initial learning rate 2.5Ă—10−52.5 \times 10^{-5}, batch size 16 (2Ă—A40 GPUs).
  • Training Schedules: CrowdAI—100 epochs (ResNet-50/Swin), LR drops at epochs 54/83; ResNeXt-101—140 epochs. Inria—100 epochs, LR drop at epoch 89.
  • Data Augmentation: Horizontal and vertical flips.
  • Hungarian matching: As in DETR and SparseR-CNN.

PolyR-CNN thus integrates efficient architectural design, a principled geometric feature coupling, and streamlined training and inference procedures, establishing new speed-accuracy Pareto frontiers for end-to-end polygonal building outline extraction from high-resolution imagery (Jiao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolyR-CNN.