PEDESTRIAN Dataset for Urban Obstacles
- PEDESTRIAN dataset is a large-scale egocentric video collection designed for benchmarking urban obstacle detection from a pedestrian’s perspective.
- It contains 82,120 frames labeled with 29 obstacle types collected under varied conditions using multiple smartphone models.
- Benchmark experiments using 16 deep CNN architectures achieved over 99.3% top-1 accuracy, highlighting its potential for real-time safety and infrastructure monitoring.
The PEDESTRIAN dataset, as introduced by Christou, Avraam, Drousiotou, and Artemiou, is a large-scale collection of egocentric video data targeting the automatic detection and recognition of obstacles on urban sidewalks from a pedestrian's point of view. Designed explicitly for the benchmarking and development of obstacle-detection algorithms, it addresses the paucity of balanced, labeled, and privacy-compliant resources for this domain. The dataset provides a comprehensive and well-structured benchmark supporting both fine-tuning and evaluation of deep learning models for egocentric pedestrian safety applications (Thoma et al., 22 Dec 2025).
1. Dataset Construction and Acquisition Protocol
The PEDESTRIAN dataset was acquired by recording 340 Full HD (1080×1920 px, portrait) videos while walking around Nicosia, Cyprus, with smartphones held at chest height to closely emulate a wearable-camera perspective. Two distinct smartphone models (Xiaomi Mi Mix 3 at 60 FPS and iPhone 7 at 30 FPS) were used to diversify sensor characteristics. Acquisition sessions were conducted in varied lighting conditions—daylight, overcast, and night—ensuring balanced representation across scenarios. Privacy-sensitive imagery (faces, license plates, identifying signs) was blurred in compliance with GDPR.
All obstacle categories emphasize typical urban impediments to pedestrian navigation, spanning static, infrastructural, and temporary hazards.
2. Data Composition, Class Distribution, and Organization
The dataset comprises 82,120 extracted frames, split evenly among 340 annotated videos, with each frame labeled according to one of 29 obstacle types nested within three high-level categories:
- Infrastructure: e.g., benches, bus stops, mail boxes, fences, parking meters.
- Physical Condition: e.g., cracks, holes/potholes, broken pavers, narrow/no pavement, litter.
- Temporary: e.g., parked vehicles (2-wheel, 4-wheel), construction barriers, traffic cones, scaffolding.
The minimum per-type frame count is 694 (“Crowded Pavement”); the maximum is 5,119 (“4-Wheel Vehicle”). For balanced recognition benchmarking, a subset was sampled with 500 frames per obstacle, yielding 14,500 images.
Dataset splits are:
- Training: 10,150 images (70%)
- Validation: 1,450 images (10%)
- Test: 2,900 images (20%)
All annotation is at the frame-level, implicit in file paths and names, with no bounding box or segmentation mask labels. File naming follows the template: img_000001_ClassID_FrameIndex.jpg.
3. Annotation Protocol and Taxonomy
Only per-frame categorical annotation is provided; no spatial localization (bounding boxes) or detailed mask annotation. Each frame is assigned a single obstacle class label, which serves both for high-level (category, subcategory) and fine-grained (29-class) recognition. The three-level taxonomy is as follows:
| Category | Subcategory | Example Types |
|---|---|---|
| Infrastructure | Street furniture, signage | Bench, Bin, Light Fixture |
| Physical Cond. | Pavement state | Crack, Hole, Broken Paver |
| Temporary | Obstructing objects | Vehicle, Cone, Scaffolding |
Privacy compliance is maintained throughout. A PyTorch loader mapping filenames to integer labels is included.
4. Benchmarking Protocols and Experimental Results
A key feature is the public release of all benchmarking splits and code. Sixteen deep CNN architectures pretrained on ImageNet (MobileNetV2/V3, EfficientNet-B0/V2, DenseNet, GoogLeNet, ResNet-18/34/50/101/152, ConvNeXt Small/Base/Large) were fine-tuned for both coarse (category, subcategory) and fine-grained (29-class) recognition targets. All models were trained using:
- Optimizer: Adam, learning rate 0.001
- Batch size: 32
- Epochs: 30
- Cross-entropy loss
- Model-specific standard preprocessing
- Training performed both with frozen and unfrozen backbone layers, repeated 5× (random splits).
At the obstacle-type (29-way) level, all models exceed 99.3% top-1 accuracy on the test set, with ConvNeXt-Small (frozen) achieving 99.96%. Coarser category/subcategory recognition also exceeds 97%.
5. Potential Applications and Use Cases
The dataset provides a foundational benchmark for:
- Smartphone/wearable obstacle detection: Real-time systems for urban navigation safety, especially useful for visually impaired or mobility-aid users.
- Municipal infrastructure monitoring: Automated reporting of hazards (cracks, obstacles), city maintenance.
- On-device warning systems: Near-real-time detection for haptic or audio alerts to prevent injuries.
Its design supports both commercial and academic research (GDPR-compliance, public Zenodo DOI).
6. Limitations and Future Directions
Known limitations are:
- Geographic and environmental bias: Data acquired only in Nicosia, Cyprus, under three lighting regimes; generalization to other urban contexts is not guaranteed.
- Lack of localization: Only frame-level class labels, no bounding boxes or pixelwise masks.
- Device diversity: Only two camera models at a fixed pedestrian height.
- Class imbalance: Although the balanced subset eliminates type imbalance, category/subcategory imbalance persists in the full dataset.
Planned extensions include broadening geographic coverage, adding detailed spatial annotation (box/mask), additional sensor modalities (e.g., depth, inertial), and developing federated-learning-ready protocols.
7. Data Access and Usage Guidelines
The complete dataset (≈5.9 GB) including full-resolution videos, the 14,500-frame balanced subset, and code (PyTorch loader, usage instructions) is publicly available via Zenodo at DOI: 10.5281/zenodo.10907945. No usage restrictions beyond academic citation are imposed.
All data are privacy-filtered and organized to facilitate rapid model development and reproducible benchmarking on pedestrian-centric urban obstacle detection tasks (Thoma et al., 22 Dec 2025).