PolyR-CNN: End-to-End Polygon Detection

Updated 30 March 2026

PolyR-CNN is an end-to-end polygon-capable R-CNN that directly predicts building outlines and bounding boxes from remote sensing imagery.
It introduces a novel vertex proposal feature that adaptively modulates RoI features based on polygon geometry to refine detection.
The method achieves state-of-the-art accuracy with faster inference speeds while supporting complex structures including buildings with holes.

PolyR-CNN is an end-to-end, polygon-capable Region-based Convolutional Neural Network (R-CNN) explicitly designed to predict both bounding boxes and vectorized polygonal building outlines directly from remotely sensed imagery. This method avoids the conventional multi-stage and complex specialized architectures typical of prior work, using a unified pipeline that leverages only Region of Interest (RoI) features and a novel vertex proposal feature scheme to directly regress ordered building polygons alongside standard detection outputs. PolyR-CNN demonstrates state-of-the-art efficiency and accuracy trade-offs in large-scale benchmarks, supports extraction of buildings with holes, and is agnostic to the use of semantic segmentation priors (Jiao et al., 2024).

1. End-to-End Architecture and Workflow

PolyR-CNN processes an input image through a standard backbone (e.g., ResNet-50, Swin-Base) with a Feature Pyramid Network (FPN), producing multi-scale features $P_2,\ldots,P_5$ . The method initializes $N$ proposal boxes (centered at $(0.5, 0.5)$ , size $(1, 1)$ ) and corresponding $N$ proposal polygons ( $M$ uniformly distributed vertices).

For $L=6$ sequential layers, the following steps are performed:

RoIAlign: Extracts RoI feature $F_{\mathrm{roi}} \in \mathbb{R}^{C \times h \times w}$ for each proposal box.
Vertex Proposal Feature: Computes a compact representation $f_\mathrm{vtx} \in \mathbb{R}^C$ from the current proposal polygon via a feedforward network (FFN) and a self-attention block.
Feature Guidance: Dynamically modulates two $1\times1$ convolutional layers on $F_{\mathrm{roi}}$ using parameters from $f_\mathrm{vtx}$ , followed by a self-attention block for interaction among all $N$ RoI features.
Prediction Heads (MLPs): Four parallel heads output the class score $s \in [0, 1]$ , box refinement $\Delta b = (\Delta x, \Delta y, \Delta w, \Delta h)$ , polygon coordinates $\{(x_i, y_i)\}_{i=1}^M$ , and vertex-validity scores $\{c_i\}_{i=1}^M$ .

After the final layer, proposals with $s < 0.05$ are discarded, and vertices with $c_i < 0.5$ are pruned to produce the output polygons.

Process Flow Diagram (per layer):

Proposal polygon {(x_i, y_i)} 
   → FFN + SA → f_vtx 
      → DynamicConv(F_roi; f_vtx) → SA 
          → {classification, box, polygon, vtx_cls}

This enables direct, recurrently refined regression of complete polygons from features localized to the proposal region.

2. Vertex Proposal Feature Construction

At each refinement layer $t$ , the current polygon $V^{(t)} = \{v_i^{(t)} = (x_i, y_i)\}_{i=1}^M$ is flattened and mapped to a high-dimensional vector: $z = \mathrm{GELU}(W_1 \, \mathrm{vec}(V^{(t)}) + b_1) \quad\in\mathbb{R}^{d_1}$

$f_{\rm vtx}^{(t)} = \mathrm{SA}(W_2 z + b_2) \quad\in\mathbb{R}^{C}$

with $d_1 = C = 256$ . This vertex feature $f_{\rm vtx}$ generates dynamic kernel weights for $1\times1$ convolutions, adaptively conditioning the RoI feature on polygon geometry. The modulated feature is: $\{K^{(t)}, b_K^{(t)}\} = \mathrm{Linear}(f_{\mathrm{vtx}}^{(t)}), \quad F_{\rm guided}^{(t)} = \mathrm{SA}(\mathrm{Conv}_{1\times 1}(F_{\rm roi}; K^{(t)}, b_K^{(t)}))$ This guidance sharpens attention around the predicted polygon structure, empirically focusing features at corners and relevant vertices.

3. Training Losses and Hungarian Assignment

Training relies on a set-based Hungarian matching between $N$ predictions and $G$ ground-truth polygons, minimizing the following cost for each pair: $C(\hat{s}, \hat{b}; s, b) = \lambda_{\rm cls} L_{\rm cls}(\hat{s}, s) + \lambda_{\rm box} L_1(\hat{b}, b) + \lambda_{\rm giou} L_{\rm GIoU}(\hat{b}, b)$ where $L_{\rm cls}$ is a focal loss for classification, $L_1$ is smooth- $L_1$ or $L_1$ box regression loss, and $L_{\rm GIoU}$ is the generalized IoU loss.

The full objective over all $N$ predictions is: $\mathcal{L} = \lambda_{\rm cls} L_{\rm cls} + \lambda_{\rm box} L_1 + \lambda_{\rm giou} L_{\rm GIoU} + \lambda_{\rm poly}\sum_{i=1}^M\lVert \hat v_i - v_i\rVert_1 + \lambda_{\rm vtx}\sum_{i=1}^M L_{\rm cls}(\hat c_i, c_i)$ with typical loss weights: $\lambda_{\rm cls}=2$ , $\lambda_{\rm box}=5$ , $\lambda_{\rm giou}=2$ , $\lambda_{\rm poly}=5$ , $\lambda_{\rm vtx}=1$ .

Set-based matching is conducted as in DETR and SparseR-CNN, consistent with recent object detection literature.

4. Handling Structures With Holes

PolyR-CNN treats each valid polygon (outer or hole) as a separate instance. At inference, hole polygons are merged into their parent via spatial-join: for each predicted hole $H$ , if a larger polygon $O$ exists such that the centroid of $H$ lies within $O$ , $H$ is assigned as a hole of $O$ . The final outline is given by set difference $O \setminus H$ . Point-in-polygon tests use winding-number or ray-crossing methods.

5. Empirical Evaluation and Comparison

PolyR-CNN is benchmarked on the CrowdAI and Inria datasets:

CrowdAI (300×300 patches, MS-COCO format)

Method	Backbone	AP	AP₅₀	AP₇₅	AP_boundary
PolyBuilding '22	ResNet-50	78.7	96.3	89.2	–
PolyWorld '22	R2U-Net	63.3	88.6	70.5	50.0
PolyR-CNN	Swin-Base	79.2	97.4	90.0	63.3
PolyR-CNN	ResNet-50	71.1	93.8	82.9	50.0

Method	Backbone	Epochs	FLOPs (G)	FPS
PolyWorld '22	–	–	181.23	8.4
PolyBuilding '22	ResNet-50	200	21.45	14.3
PolyR-CNN	ResNet-50	100	21.91	32.7
PolyR-CNN	Swin-Base	100	46.55	20.5

With ResNet-50, PolyR-CNN achieves 71.1 AP at 32.7 FPS (4× faster than PolyWorld) in 100 epochs. Using Swin-Base, it reaches 79.2 AP, matching the most accurate methods while operating at more than twice their speed.

Inria (512×512, mask-IoU evaluation)

Method	Needs sem-seg?	IoU	Acc.	FPS
Zorzi et al. '19	✓	59.81	93.92	–
HiSup '22	✓	75.53	96.27	16.4
PolyR-CNN	–	68.35	95.09	35.7

PolyR-CNN, without any semantic segmentation prior, surpasses classical polygonization in accuracy and more than doubles the speed of the best segmentation-based pipeline.

6. Implementation Specifics

Key components and parameters:

Backbones: ResNet-50 (ImageNet-1K), ResNeXt-101, Swin-Base (ImageNet-21K).
FPN: C=256 channels, pyramid levels P2–P5.
Proposals: $N=100, M=96$ (CrowdAI); $N=300, M=50$ (Inria).
Optimization: AdamW, initial learning rate $2.5 \times 10^{-5}$ , batch size 16 (2×A40 GPUs).
Training Schedules: CrowdAI—100 epochs (ResNet-50/Swin), LR drops at epochs 54/83; ResNeXt-101—140 epochs. Inria—100 epochs, LR drop at epoch 89.
Data Augmentation: Horizontal and vertical flips.
Hungarian matching: As in DETR and SparseR-CNN.

PolyR-CNN thus integrates efficient architectural design, a principled geometric feature coupling, and streamlined training and inference procedures, establishing new speed-accuracy Pareto frontiers for end-to-end polygonal building outline extraction from high-resolution imagery (Jiao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolyR-CNN.