PolyR-CNN: End-to-End Polygon Detection
- PolyR-CNN is an end-to-end polygon-capable R-CNN that directly predicts building outlines and bounding boxes from remote sensing imagery.
- It introduces a novel vertex proposal feature that adaptively modulates RoI features based on polygon geometry to refine detection.
- The method achieves state-of-the-art accuracy with faster inference speeds while supporting complex structures including buildings with holes.
PolyR-CNN is an end-to-end, polygon-capable Region-based Convolutional Neural Network (R-CNN) explicitly designed to predict both bounding boxes and vectorized polygonal building outlines directly from remotely sensed imagery. This method avoids the conventional multi-stage and complex specialized architectures typical of prior work, using a unified pipeline that leverages only Region of Interest (RoI) features and a novel vertex proposal feature scheme to directly regress ordered building polygons alongside standard detection outputs. PolyR-CNN demonstrates state-of-the-art efficiency and accuracy trade-offs in large-scale benchmarks, supports extraction of buildings with holes, and is agnostic to the use of semantic segmentation priors (Jiao et al., 2024).
1. End-to-End Architecture and Workflow
PolyR-CNN processes an input image through a standard backbone (e.g., ResNet-50, Swin-Base) with a Feature Pyramid Network (FPN), producing multi-scale features . The method initializes proposal boxes (centered at , size ) and corresponding proposal polygons ( uniformly distributed vertices).
For sequential layers, the following steps are performed:
- RoIAlign: Extracts RoI feature for each proposal box.
- Vertex Proposal Feature: Computes a compact representation from the current proposal polygon via a feedforward network (FFN) and a self-attention block.
- Feature Guidance: Dynamically modulates two convolutional layers on using parameters from , followed by a self-attention block for interaction among all RoI features.
- Prediction Heads (MLPs): Four parallel heads output the class score , box refinement , polygon coordinates , and vertex-validity scores .
After the final layer, proposals with are discarded, and vertices with are pruned to produce the output polygons.
Process Flow Diagram (per layer):
1 2 3 4 |
Proposal polygon {(x_i, y_i)}
→ FFN + SA → f_vtx
→ DynamicConv(F_roi; f_vtx) → SA
→ {classification, box, polygon, vtx_cls} |
2. Vertex Proposal Feature Construction
At each refinement layer , the current polygon is flattened and mapped to a high-dimensional vector:
with . This vertex feature generates dynamic kernel weights for convolutions, adaptively conditioning the RoI feature on polygon geometry. The modulated feature is: This guidance sharpens attention around the predicted polygon structure, empirically focusing features at corners and relevant vertices.
3. Training Losses and Hungarian Assignment
Training relies on a set-based Hungarian matching between predictions and ground-truth polygons, minimizing the following cost for each pair: where is a focal loss for classification, is smooth- or box regression loss, and is the generalized IoU loss.
The full objective over all predictions is: with typical loss weights: , , , , .
Set-based matching is conducted as in DETR and SparseR-CNN, consistent with recent object detection literature.
4. Handling Structures With Holes
PolyR-CNN treats each valid polygon (outer or hole) as a separate instance. At inference, hole polygons are merged into their parent via spatial-join: for each predicted hole , if a larger polygon exists such that the centroid of lies within , is assigned as a hole of . The final outline is given by set difference . Point-in-polygon tests use winding-number or ray-crossing methods.
5. Empirical Evaluation and Comparison
PolyR-CNN is benchmarked on the CrowdAI and Inria datasets:
CrowdAI (300Ă—300 patches, MS-COCO format)
| Method | Backbone | AP | AP₅₀ | AP₇₅ | AP_boundary |
|---|---|---|---|---|---|
| PolyBuilding '22 | ResNet-50 | 78.7 | 96.3 | 89.2 | – |
| PolyWorld '22 | R2U-Net | 63.3 | 88.6 | 70.5 | 50.0 |
| PolyR-CNN | Swin-Base | 79.2 | 97.4 | 90.0 | 63.3 |
| PolyR-CNN | ResNet-50 | 71.1 | 93.8 | 82.9 | 50.0 |
| Method | Backbone | Epochs | FLOPs (G) | FPS |
|---|---|---|---|---|
| PolyWorld '22 | – | – | 181.23 | 8.4 |
| PolyBuilding '22 | ResNet-50 | 200 | 21.45 | 14.3 |
| PolyR-CNN | ResNet-50 | 100 | 21.91 | 32.7 |
| PolyR-CNN | Swin-Base | 100 | 46.55 | 20.5 |
With ResNet-50, PolyR-CNN achieves 71.1 AP at 32.7 FPS (4Ă— faster than PolyWorld) in 100 epochs. Using Swin-Base, it reaches 79.2 AP, matching the most accurate methods while operating at more than twice their speed.
Inria (512Ă—512, mask-IoU evaluation)
| Method | Needs sem-seg? | IoU | Acc. | FPS |
|---|---|---|---|---|
| Zorzi et al. '19 | ✓ | 59.81 | 93.92 | – |
| HiSup '22 | ✓ | 75.53 | 96.27 | 16.4 |
| PolyR-CNN | – | 68.35 | 95.09 | 35.7 |
PolyR-CNN, without any semantic segmentation prior, surpasses classical polygonization in accuracy and more than doubles the speed of the best segmentation-based pipeline.
6. Implementation Specifics
Key components and parameters:
- Backbones: ResNet-50 (ImageNet-1K), ResNeXt-101, Swin-Base (ImageNet-21K).
- FPN: C=256 channels, pyramid levels P2–P5.
- Proposals: (CrowdAI); (Inria).
- Optimization: AdamW, initial learning rate , batch size 16 (2Ă—A40 GPUs).
- Training Schedules: CrowdAI—100 epochs (ResNet-50/Swin), LR drops at epochs 54/83; ResNeXt-101—140 epochs. Inria—100 epochs, LR drop at epoch 89.
- Data Augmentation: Horizontal and vertical flips.
- Hungarian matching: As in DETR and SparseR-CNN.
PolyR-CNN thus integrates efficient architectural design, a principled geometric feature coupling, and streamlined training and inference procedures, establishing new speed-accuracy Pareto frontiers for end-to-end polygonal building outline extraction from high-resolution imagery (Jiao et al., 2024).