Rail-5k Dataset Benchmark

Updated 21 November 2025

Rail-5k dataset is a comprehensive benchmark for visual rail defect detection using both annotated and uncurated images from diverse railway environments.
It supports fully-supervised and semi-supervised learning paradigms with rigorous annotation protocols for object detection and semantic segmentation tasks.
Key challenges include long-tailed defect distributions, fine-grained class distinctions, and robustness to real-world image corruptions.

The Rail-5k dataset is a large-scale benchmark designed for visual rail surface defect detection under real-world conditions. It encompasses a comprehensive collection of annotated and uncurated imagery sampled from diverse railway scenarios throughout China, addressing key practical challenges such as fine-grained classification, long-tailed defect distribution, and robustness to real-world image corruptions. The dataset’s design supports both fully-supervised and semi-supervised learning paradigms and establishes rigorous annotation and evaluation protocols for both object detection and semantic segmentation tasks (Zhang et al., 2021).

1. Collection and Imaging Protocol

Rail-5k consists of approximately 5,000 high-resolution RGB images captured in various operational railway environments, including tunnels, bridges, and both straight and curved tracks. Of these, 1,100 images bear expert-provided defect annotations; the remaining 4,000 images are unlabeled and intentionally uncurated, containing real-world corruptions such as motion blur, uneven lighting, and foreign objects.

Imaging hardware was mounted on inspection cars, with cameras positioned about 200 mm above the rail surface and oriented vertically downward. To ensure high annotation quality, frames exhibiting strong shadows or over-exposure were excluded. Labeled images are standardized to a resolution of 3648 × 2736 pixels, while the unlabeled subset was subjected only to minimal preprocessing, specifically removal of over-exposed and shadowed frames, with no resizing or color normalization imposed.

2. Expert Annotation and Defect Taxonomy

Annotation was performed by a cohort of 10 railway experts employing a multi-stage review, in which each image received evaluation by at least three experts to resolve ambiguities and enforce label consensus (no quantitative inter-annotator agreement coefficient is reported). The dataset defines a taxonomy of 13 distinct rail-related defect classes, with defect definitions drawn from railway standards:

Running Surface – large, clear region on the rail head
Contact Band – polished subregion beneath wheel contact
Dark Contact Band – similar to contact band but darker
Spalling – small, chip-like missing material (“stripped dent”)
Crack – thin, diffuse fissures (annotated via mask)
Corrugation – wavy, periodic wear, labeled along valleys
Grinding – post-maintenance stripe patterns
Fastener – broad clip connecting rail to sleeper
Spike Screw – large fastening screw
Set Screw – small adjustment screw
Indentation – small, distinct dents
Burning – localized thermal discoloration
Welded Joint – flash from weld seam

The annotation protocol tailors the bounding paradigm to the semantic and morphological properties of each class, as summarized in the table:

Size	Boundary	Example Classes	Annotation Type
Large	clear	Rail Surface, Fastener	Rectangle box
Large	obscure	Corrugation	Valley boundary
Small	clear	Spalling, Indentation	Tight box
Diffuse	sharp (Crack only)	Crack	Mask/union boxes

Cracks, too thin for effective bounding-box annotation, are labeled by pixel-wise segmentation masks. For other fine or ambiguous cases, dense box unions are applied. Tools are not explicitly specified; common tools include LabelImg and class-specific custom mask editors.

3. Data Partitioning and Benchmarking Settings

The labeled portion of Rail-5k is divided for two principal research settings:

Fully-supervised: The 1,100 labeled images are randomly split into training (≈880 images; 80%) and testing (≈220 images; 20%). No prescribed validation split, but users may allocate 10% of training images for hyper-parameter tuning.
Semi-supervised: The same 220-image test set is retained. All remaining labeled data and the full set of 4,000 unlabeled images are used for joint semi-supervised training. The unlabeled subset introduces domain shifts and unknown corruptions and is left without any manual annotation, providing a challenging scenario for robustness evaluation.

4. Statistical Distribution and Major Challenges

The dataset exhibits an extreme long-tailed distribution of classes. Let $n_c$ denote the bounding-box count for class $c$ . The imbalance ratio is

$\frac{\max_c n_c}{\min_{c:\,n_c>0} n_c} \approx 40.98$

with spalling being the most populous class and welded joint the least. Table: class statistics.

Class	#Boxes ( $n_c$ )	#Images
Spalling	12,582	1,005
Crack (mask)	3,785	375
Corrugation	3,349	445
Contact Band	1,093	1,087
Running Surface	1,082	1,080
Dark Contact Band	773	769
Fastener	757	582
Spike Screw	502	424
Set Screw	414	360
Indentation	307	216
Grinding	337	179
Burning	41	10
Welded Joint	14	8

The fine-grain nature (e.g., distinguishing contact band from dark contact band), variable spatial scale, and the inherent difficulties in region annotation for diffuse defects such as cracks contribute to the complexity. The uncurated, unlabeled images further introduce background domain shifts and object appearance corruptions.

5. Evaluation Metrics and Protocols

Performance on Rail-5k is evaluated by class-agnostic and class-specific object detection and segmentation metrics:

Object Detection:
- AP@[0.5]: average precision at intersection-over-union (IoU) threshold 0.5,
- COCO-style $\mathrm{mAP}@[0.5:0.95]$ : mean AP over multiple IoU thresholds,
- A detection is a true positive if IoU $\geq$ 0.5 with a ground-truth box of the same class.

AP per class $c$ is computed as:

$\mathrm{AP}_c = \int_0^1 p_c(r)\,dr$

and mAP as:

$\mathrm{mAP} = \frac{1}{|C|} \sum_{c=1}^{|C|} \mathrm{AP}_c$

Segmentation (Crack):
- Evaluated using intersection-over-union (IoU) for each class:

$\mathrm{IoU} = \frac{|\mathrm{pred}\cap\mathrm{gt}|}{|\mathrm{pred}\cup\mathrm{gt}|}$

6. Baseline Results and Algorithmic Benchmarks

Representative baselines were established using YOLOv5-s for detection and DeepLabv3+ with a ResNet-50 backbone for segmentation:

Detection (YOLOv5-s):
- Pre-trained on MS-COCO, trained for 300 epochs with standard hyperparameters, employing mosaic augmentation and GIoU loss.
- Detection performance varied widely by class; e.g., [email protected] is 98.9% for rail surface, 94.5% for contact band, 60.0% for spalling, and 24.0% for grinding. Indentation yielded 0 AP, and crack is not reported as detection but as segmentation.
Segmentation (DeepLabv3+R-50):
- Trained for 9,000 iterations (batch size 16, SGD, momentum 0.01, weight decay 1e-4).
- Achieved IoU of 98.9% (background) and 67.8% (crack).
Semi-supervised Detection (Pseudo-labeling):
- Unlabeled data labeled by model inference and filtered by confidence threshold $s_\text{thr}$ , followed by joint fine-tuning for 1 epoch at learning rate $4 \times 10^{-4}$ .
- [email protected] for various $s_\text{thr}$ :
- $s_\text{thr}=0.6$ : 63.29
- $s_\text{thr}=0.7$ : 63.27
- $s_\text{thr}=0.8$ : 62.43
- $s_\text{thr}=0.9$ : 61.55
- Notably, performance drops under semi-supervised regime, evidence of significant domain gap and label noise in the unlabeled set (Zhang et al., 2021).

7. Open Problems and Prospective Extensions

Challenges highlighted by Rail-5k include reliably handling rare and fine-grained defect classes, achieving robust performance under domain shift, and overcoming the limitations of bounding-box detectors for ill-defined or diffuse regions such as cracks. Indentation defects in particular yield negligible AP using standard detection models.

Suggested directions for advancing performance on Rail-5k encompass class-balanced sampling or focal loss variants for long-tail distributions, domain adaptation strategies or corruption augmentations to bridge labeled-unlabeled domain gaps, and robust semi-supervised methods such as consistency regularization and noise-aware pseudo-label filtering. The dataset creators plan future extensions incorporating multi-modal data sources (e.g., 3D scan, eddy-current data), which may help further boost reliability in operational rail inspection settings (Zhang et al., 2021).

Rail-5k provides a valuable resource for benchmarking visual algorithms under stringent, practically relevant conditions, supporting research in object detection, fine-grained defect classification, segmentation, and semi-supervised learning robustness.

PDF Markdown Chat (Pro)

References (1)

Rail-5k: a Real-World Dataset for Rail Surface Defects Detection (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Rail-5k Dataset.