Rail-5k Dataset Benchmark
- Rail-5k dataset is a comprehensive benchmark for visual rail defect detection using both annotated and uncurated images from diverse railway environments.
- It supports fully-supervised and semi-supervised learning paradigms with rigorous annotation protocols for object detection and semantic segmentation tasks.
- Key challenges include long-tailed defect distributions, fine-grained class distinctions, and robustness to real-world image corruptions.
The Rail-5k dataset is a large-scale benchmark designed for visual rail surface defect detection under real-world conditions. It encompasses a comprehensive collection of annotated and uncurated imagery sampled from diverse railway scenarios throughout China, addressing key practical challenges such as fine-grained classification, long-tailed defect distribution, and robustness to real-world image corruptions. The dataset’s design supports both fully-supervised and semi-supervised learning paradigms and establishes rigorous annotation and evaluation protocols for both object detection and semantic segmentation tasks (Zhang et al., 2021).
1. Collection and Imaging Protocol
Rail-5k consists of approximately 5,000 high-resolution RGB images captured in various operational railway environments, including tunnels, bridges, and both straight and curved tracks. Of these, 1,100 images bear expert-provided defect annotations; the remaining 4,000 images are unlabeled and intentionally uncurated, containing real-world corruptions such as motion blur, uneven lighting, and foreign objects.
Imaging hardware was mounted on inspection cars, with cameras positioned about 200 mm above the rail surface and oriented vertically downward. To ensure high annotation quality, frames exhibiting strong shadows or over-exposure were excluded. Labeled images are standardized to a resolution of 3648 × 2736 pixels, while the unlabeled subset was subjected only to minimal preprocessing, specifically removal of over-exposed and shadowed frames, with no resizing or color normalization imposed.
2. Expert Annotation and Defect Taxonomy
Annotation was performed by a cohort of 10 railway experts employing a multi-stage review, in which each image received evaluation by at least three experts to resolve ambiguities and enforce label consensus (no quantitative inter-annotator agreement coefficient is reported). The dataset defines a taxonomy of 13 distinct rail-related defect classes, with defect definitions drawn from railway standards:
- Running Surface – large, clear region on the rail head
- Contact Band – polished subregion beneath wheel contact
- Dark Contact Band – similar to contact band but darker
- Spalling – small, chip-like missing material (“stripped dent”)
- Crack – thin, diffuse fissures (annotated via mask)
- Corrugation – wavy, periodic wear, labeled along valleys
- Grinding – post-maintenance stripe patterns
- Fastener – broad clip connecting rail to sleeper
- Spike Screw – large fastening screw
- Set Screw – small adjustment screw
- Indentation – small, distinct dents
- Burning – localized thermal discoloration
- Welded Joint – flash from weld seam
The annotation protocol tailors the bounding paradigm to the semantic and morphological properties of each class, as summarized in the table:
| Size | Boundary | Example Classes | Annotation Type |
|---|---|---|---|
| Large | clear | Rail Surface, Fastener | Rectangle box |
| Large | obscure | Corrugation | Valley boundary |
| Small | clear | Spalling, Indentation | Tight box |
| Diffuse | sharp (Crack only) | Crack | Mask/union boxes |
Cracks, too thin for effective bounding-box annotation, are labeled by pixel-wise segmentation masks. For other fine or ambiguous cases, dense box unions are applied. Tools are not explicitly specified; common tools include LabelImg and class-specific custom mask editors.
3. Data Partitioning and Benchmarking Settings
The labeled portion of Rail-5k is divided for two principal research settings:
- Fully-supervised: The 1,100 labeled images are randomly split into training (≈880 images; 80%) and testing (≈220 images; 20%). No prescribed validation split, but users may allocate 10% of training images for hyper-parameter tuning.
- Semi-supervised: The same 220-image test set is retained. All remaining labeled data and the full set of 4,000 unlabeled images are used for joint semi-supervised training. The unlabeled subset introduces domain shifts and unknown corruptions and is left without any manual annotation, providing a challenging scenario for robustness evaluation.
4. Statistical Distribution and Major Challenges
The dataset exhibits an extreme long-tailed distribution of classes. Let denote the bounding-box count for class . The imbalance ratio is
with spalling being the most populous class and welded joint the least. Table: class statistics.
| Class | #Boxes () | #Images |
|---|---|---|
| Spalling | 12,582 | 1,005 |
| Crack (mask) | 3,785 | 375 |
| Corrugation | 3,349 | 445 |
| Contact Band | 1,093 | 1,087 |
| Running Surface | 1,082 | 1,080 |
| Dark Contact Band | 773 | 769 |
| Fastener | 757 | 582 |
| Spike Screw | 502 | 424 |
| Set Screw | 414 | 360 |
| Indentation | 307 | 216 |
| Grinding | 337 | 179 |
| Burning | 41 | 10 |
| Welded Joint | 14 | 8 |
The fine-grain nature (e.g., distinguishing contact band from dark contact band), variable spatial scale, and the inherent difficulties in region annotation for diffuse defects such as cracks contribute to the complexity. The uncurated, unlabeled images further introduce background domain shifts and object appearance corruptions.
5. Evaluation Metrics and Protocols
Performance on Rail-5k is evaluated by class-agnostic and class-specific object detection and segmentation metrics:
- Object Detection:
- AP@[0.5]: average precision at intersection-over-union (IoU) threshold 0.5,
- COCO-style : mean AP over multiple IoU thresholds,
- A detection is a true positive if IoU 0.5 with a ground-truth box of the same class.
AP per class is computed as:
and mAP as:
- Segmentation (Crack):
- Evaluated using intersection-over-union (IoU) for each class:
6. Baseline Results and Algorithmic Benchmarks
Representative baselines were established using YOLOv5-s for detection and DeepLabv3+ with a ResNet-50 backbone for segmentation:
- Detection (YOLOv5-s):
- Pre-trained on MS-COCO, trained for 300 epochs with standard hyperparameters, employing mosaic augmentation and GIoU loss.
- Detection performance varied widely by class; e.g., [email protected] is 98.9% for rail surface, 94.5% for contact band, 60.0% for spalling, and 24.0% for grinding. Indentation yielded 0 AP, and crack is not reported as detection but as segmentation.
- Segmentation (DeepLabv3+R-50):
- Trained for 9,000 iterations (batch size 16, SGD, momentum 0.01, weight decay 1e-4).
- Achieved IoU of 98.9% (background) and 67.8% (crack).
- Semi-supervised Detection (Pseudo-labeling):
- Unlabeled data labeled by model inference and filtered by confidence threshold , followed by joint fine-tuning for 1 epoch at learning rate .
- [email protected] for various :
- : 63.29
- : 63.27
- : 62.43
- : 61.55
- Notably, performance drops under semi-supervised regime, evidence of significant domain gap and label noise in the unlabeled set (Zhang et al., 2021).
7. Open Problems and Prospective Extensions
Challenges highlighted by Rail-5k include reliably handling rare and fine-grained defect classes, achieving robust performance under domain shift, and overcoming the limitations of bounding-box detectors for ill-defined or diffuse regions such as cracks. Indentation defects in particular yield negligible AP using standard detection models.
Suggested directions for advancing performance on Rail-5k encompass class-balanced sampling or focal loss variants for long-tail distributions, domain adaptation strategies or corruption augmentations to bridge labeled-unlabeled domain gaps, and robust semi-supervised methods such as consistency regularization and noise-aware pseudo-label filtering. The dataset creators plan future extensions incorporating multi-modal data sources (e.g., 3D scan, eddy-current data), which may help further boost reliability in operational rail inspection settings (Zhang et al., 2021).
Rail-5k provides a valuable resource for benchmarking visual algorithms under stringent, practically relevant conditions, supporting research in object detection, fine-grained defect classification, segmentation, and semi-supervised learning robustness.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free