Papers
Topics
Authors
Recent
2000 character limit reached

Global Benchmark for Pavement Defect Detection

Updated 30 December 2025
  • The paper introduces a globally representative benchmark (PaveSync) that aggregates 52,747 images from 8 countries and 13 defect types to assess model generalizability.
  • It details the harmonized taxonomy and unified annotation formats across standards like Pascal VOC, COCO, and YOLO, ensuring consistent cross-model evaluations.
  • Rigorous benchmarking protocols using stratified sampling and zero-shot transfer reveal a ~25% mAP@50 drop on unseen data, highlighting practical domain shift challenges.

Automated pavement defect detection requires datasets and evaluation protocols that capture global variation in distress types, imaging contexts, materials, and environmental conditions. A globally representative benchmark is defined by comprehensive geographic coverage, harmonized taxonomies, standardized annotation formats, and rigorous assessment of model generalizability, including domain transfer.

1. Composition and Geographic Coverage of Representative Benchmarks

The "PaveSync" dataset constitutes the most globally representative benchmark for pavement defect detection published to date (Kyem et al., 23 Dec 2025). It amalgamates 52,747 images originating from seven countries—namely Iran, China, United States, Japan, India, Czech Republic, Norway, and Ghana—with 135,277 bounding box annotations spanning 13 distinct distress types. Table 1 summarizes country distribution and class coverage.

Country Images % of Total Major Distress Types Present
Iran 14,983 28.4% Alligator, Longitudinal, Rutting, Pothole
China 12,660 24.0% Longitudinal, Transverse, Alligator
United States 7,174 13.6% Block, Alligator, Manhole, Rutting
Japan 9,071 17.2% Longitudinal, Pothole, Transverse
India 3,692 7.0% Pothole, Longitudinal, Repair
Czech Republic 1,213 2.3% Transverse, Longitudinal, Patching
Norway 3,376 6.4% Edge Cracking, Rutting, Bleeding
Ghana 578 1.1% Longitudinal, Pothole, Block

Four continents are represented, with imagery acquired via ground-level, aerial (drone), top-down, and pavement-level platforms. Illumination scenarios include clear, rainy, and snowy conditions, with seasonal shadow variability.

Earlier datasets such as IEEE GRDC (Heitzmann, 2022) included 21,041 images from Japan, India, and Czech Republic, using a dashboard-mounted smartphone imaging protocol; however, coverage was restricted to four defect classes and three countries. DSPC (Behzadian et al., 2022) sampled only three urban locations in Missouri, USA, and seven defect types. Both lacked representation from Africa, Latin America, Oceania, and did not capture material, climate, or urban/rural diversity, limiting their global representativeness.

2. Taxonomy, Annotation, and Standardization

The taxonomy within PaveSync comprises 13 distress types, each mapped to a unique class identifier (0–12) and harmonized across source datasets. Examples include Bleeding (ID 0: 1,885 boxes), Rutting (ID 5: 17,399), Alligator Cracking (ID 7: 20,677), Longitudinal Cracking (ID 8: 33,353), Pothole (ID 4: 28,638), and Edge Cracking (ID 12: 1,714).

Annotation sources—Pascal VOC (XML), COCO (JSON), and YOLO (TXT)—were converted to a unified format. Quality control was enforced via stratified sampling validation, iterative box correction, and name harmonization to resolve synonyms, drop ambiguous and low-frequency classes (<300 instances), and ensure cross-dataset class-ID consistency. Axis-aligned bounding boxes (xmin, ymin, width, height) were used throughout (Kyem et al., 23 Dec 2025).

The GRDC taxonomy followed Japanese Maintenance Guidebooks, restricted to four high-frequency classes (longitudinal, lateral, alligator cracks, potholes), each with COCO-style bounding boxes (Heitzmann, 2022). DSPC included seven distress types, annotated via the CVAT tool with senior engineer adjudication (Behzadian et al., 2022).

The Shannon entropy of PaveSync's class label distribution is H2.55H \approx 2.55 bits (maximum possible for 13 classes log132.56\log 13 \approx 2.56), indicating near-uniform class diversity. Geographic entropy is Hgeo0.92H_\mathrm{geo} \approx 0.92 (normalized to log8\log 8), reflecting balanced multi-country sampling.

3. Benchmarking Protocols and Model Evaluation

PaveSync delineates its training, validation, and test splits via stratified sampling across both class and country. The splits consist of 90% training (47,473 images), 10% validation (5,274 images), ensuring no overlap in geographic clusters or image sequences between splits. Special "zero-shot" transfer splits hold out one country (e.g., Ghana) during training and evaluate models exclusively on the held-out subset to assess cross-domain generalizability.

Benchmarked models include YOLOv8 through YOLOv12 (CSPDarkNet backbone, Adam optimizer, cosine annealing learning rate from 0.001 to 1×1061 \times 10^{-6}, 1,000 epochs), Faster R-CNN (ResNet-50, SGD 0.02, step decay), and DETR (ResNet-50, AdamW, linear warmup/decay), all trained with batch size 16 (Kyem et al., 23 Dec 2025).

GRDC protocols used YOLOv5-x (142M params), YOLOv5-l (77M), and Faster R-CNN, with hyperparameters including batch sizes {8, 16, 32}, IoU threshold variation for NMS (0.80–0.999), and SGD optimizers (with or without Nesterov momentum) over 150 epochs (Heitzmann, 2022).

DSPC standardized a single YOLOv5-style one-stage detector for all teams, enforcing fixed backbone architecture and allowing tuning of inference thresholds. Annotation corrections, mosaic mixup, pixel inversion, and DCGAN-based augmentation were strategies employed by top teams (Behzadian et al., 2022).

4. Metrics and Quantitative Results

Detection performance is measured using Precision (P=TPTP+FPP=\frac{TP}{TP+FP}), Recall (R=TPTP+FNR=\frac{TP}{TP+FN}), F1-score (F1=2PRP+RF_1=\frac{2PR}{P+R}), Average Precision (APcAP_c), and mean Average Precision (mAPmAP) computed at various IoU thresholds (e.g., mAP@50mAP@50).

PaveSync's class-averaged benchmarking results (see Table VII in (Kyem et al., 23 Dec 2025)):

Model mAP@50 mAP@[50–95] Avg P Avg R Avg F1
YOLOv8 0.672 0.462 0.657 0.679 0.668
YOLOv10 0.667 0.458 0.658 0.672 0.665
YOLOv12 0.655 0.445 0.643 0.658 0.650
Faster R-CNN 0.661 0.459 0.649 0.660 0.654
DETR 0.635 0.432 0.627 0.642 0.634

Zero-shot transfer to Ghana (excluded from training):

Model mAP@50 Δ (vs. in-domain) mAP@[50–95] Δ
YOLOv8 0.498 –0.174 0.312 –0.150
YOLOv10 0.512 –0.155 0.328 –0.130
Faster R-CNN 0.505 –0.156 0.315 –0.144
DETR 0.472 –0.163 0.298 –0.132

This performance reduction of ~25% in mAP@50 quantifies real-world domain shift; YOLOv10 and Faster R-CNN displayed the best zero-shot generalization properties.

GRDC's best ensemble F1 was 0.68 (test1) and 0.677 (test2), single-model F1s ranged 0.49–0.52 (Heitzmann, 2022). DSPC's leaderboard winner achieved F1 = 0.8953 by aggressive augmentation and annotation refinement; second-best by ≥0.76, with performance varying by distress type and capture modality (Behzadian et al., 2022).

5. Data-centric Innovations and Generalization Strategies

Data-centric techniques underpin advances in benchmark utility and generalization. In PaveSync, harmonization of annotation formats and distress definitions eliminated ambiguities and allowed for a unified class-ID scheme, facilitating cross-model and cross-domain evaluation.

Semi-supervised labeling—where model predictions are used to propose labels on unlabeled batches, followed by targeted manual correction—was shown in DSPC to efficiently rectify annotation errors and optimize labor (Behzadian et al., 2022).

Augmentation strategies included mosaic mixing (4-image mixup), geometric transformations, hue/saturation jitter, pixel inversion, and, in DSPC, GAN-based synthetic data generation for rare class enrichment (+0.09 F1 isolated gain). Inference robustness was further enhanced by test-time augmentation and confidence/NMS threshold grid search (optimal confidence: 0.4–0.6, optimal NMS: 0.4–0.5 for crowded scenes) (Behzadian et al., 2022).

Model ensembling across training hyperparameters and network variants, as in GRDC top-5 solutions, amplified F1 and stabilized results under varied lighting and road textures (Heitzmann, 2022).

6. Limitations, Discussion, and Future Benchmark Directions

Despite extensive strides, several limitations persist. PaveSync highlights class imbalance (e.g., underrepresented Block Cracking and Bumps and Sags), incomplete weather context (lack of night, hail, fog imagery), annotations limited to bounding boxes, and absence of temporal crack evolution data (Kyem et al., 23 Dec 2025).

Coverage gaps remain in polar, tropical, and high-altitude environments as well as select continents (e.g., Latin America, Oceania) in all benchmarks. A plausible implication is that current generalization estimates may overstate transfer capacity in truly unseen road and climate domains.

Future recommendations across all sources include expanding geographic and environmental coverage, introducing pixel-level segmentation and severity quantification, enabling evaluation across multiple sensor modalities (e.g., thermal, LiDAR), and developing public, standardized test sets for leaderboard-driven progress. Transfer benchmarking—systematic training/testing across sensor and viewpoint domains—remains essential for quantifying domain shift resilience and robustness (Kyem et al., 23 Dec 2025, Heitzmann, 2022, Behzadian et al., 2022).

7. Historical Evolution and Benchmark Impact

Benchmark development for pavement defect detection has evolved from regional, small-scale datasets (IEEE GRDC: 21,041 images, 3 countries, 4 classes (Heitzmann, 2022); DSPC: ~800 images, 1 country, 7 classes (Behzadian et al., 2022)) to globally harmonized corpora such as PaveSync (52,747 images, 8 countries, 13 classes (Kyem et al., 23 Dec 2025)).

Standardization of taxonomy and annotation protocols underpins fair cross-model comparison, transferability studies, and meaningful evaluation on unseen road environments. The emergence of multi-modal, stratified, and rigorously annotated benchmarks marks a pivotal advance, enabling objective progress measurement and fostering repeatable science in automated pavement infrastructure monitoring.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Globally Representative Benchmark for Pavement Defect Detection.