AV-GPS-Dataset for Travel Mode Detection
- AV-GPS-Dataset is a curated GPS trajectory set with labeled travel modes (walking, bicycle, bus, railway) collected across varied urban environments.
- It employs a rigorous month-long data collection with high spatial precision and systematic preprocessing steps like outlier removal and subsampling.
- Benchmark evaluations using Random Forest classifiers demonstrate its effectiveness in deciphering urban mobility patterns and ensuring reproducibility.
The AV-GPS-Dataset is a publicly available, multi-modal GPS trajectory dataset specifically curated and benchmarked for travel mode detection in the context of human mobility research. It constitutes the first openly licensed resource to provide fine-grained ground-truth modal labels (walking, bicycle, bus, railway) for raw GPS trajectories over a substantial period and across varied urban environments. Released under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license, the AV-GPS-Dataset supports reproducible research in travel mode classification, robust feature extraction, and transferability analysis for mobility modeling (Chen et al., 2021).
1. Data Collection Protocol
The AV-GPS-Dataset was compiled using a month-long campaign (October 2020) with seven independent volunteers traversing various regions of the Tokyo/Kashiwa area in Japan, encompassing dense urban cores, suburban districts, rural segments, and metropolitan parks. Ground-truth annotation covers 212 walking trips, 138 bicycle segments, 56 bus rides, and 69 railway journeys, each with a minimum duration of 10 minutes. Data were captured using Android smartphones running a custom logging application. The system sampled GPS at 1 Hz, yielding raw spatial precision of approximately 2–3 meters, with error values recorded to meter resolution. This high-frequency raw logging permits detailed shape-based analysis but was later systematically subsampled to intervals of 1–5 minutes to emulate the temporal sparsity found in production-scale mobility records.
2. Data Organization, Schema, and Preprocessing
Each datum comprises a timestamp (UTC, ISO 8601), latitude/longitude (WGS84), estimated horizontal error, volunteer identifier, trip identifier, and mode label. Data are distributed as per-point CSV and GeoJSON records, hierarchically organized by volunteer and mode label:
1 2 3 4 5 6 7 8 9 |
AV-GPS/ ├─ volunteer_01/ │ ├─ walking/ │ │ ├─ trip_01.csv │ │ └─ … │ └─ bicycle/ │ └─ … ├─ volunteer_02/ └─ … |
Three primary preprocessing steps were performed:
- Outlier removal: All points with estimated error or m were excluded.
- Subsampling: Trajectories for downstream benchmarks were resampled to intervals minutes using a closest-timestamp policy.
- Feature generation: Distance, velocity, and kinematic features were calculated on each subsampled segment post-cleaning.
3. Annotation, Quality Control, and Labeling
Travel mode was defined into four mutually exclusive classes: walking (mean speed $1$–$7$ km/h), bicycle (non-motorized, $5$–$25$ km/h), bus (surface transit with frequent stops), and railway (heavy/commercial rail lines, typically higher average speeds). Volunteers manually selected their current mode at trip commencement and were required to terminate/restart trip logs upon mode change. Modal transitions (e.g., walk-to-bus) are thus precise at the minute scale. Multiple repetitions of routes across different diurnal cycles and traffic conditions were performed to ensure sample diversity and test robustness to environmental variability. Additional annotation validation included minimum trip duration enforcement ( min), random cross-checks against Google Maps’ published transit schedules, and secondary error-based outlier filtering.
4. Benchmark Tasks, Feature Set, and Evaluation
The canonical task is modal classification: mapping a subsampled trajectory segment (at –$5$ min) to a mode . The benchmark uses a Random Forest classifier (100 trees, unlimited depth), trained on 80% of the trips and validated via 10-fold cross-validation (shuffled at the trip level for generalization integrity).
The standardized feature set for inference comprises:
| # | Name | Formula / Definition |
|---|---|---|
| 1 | Distance | |
| 2 | Time | |
| 3 | Points | |
| 4 | VCR (mean) | |
| 5 | MVCR (max) | |
| 6 | MaxAccel | |
| 7 | AvgSpeed₁ | |
| 8 | MinSpeed | |
| 9 | MaxSpeed | |
| 10 | AvgSpeed₂ |
where and denotes haversine distance.
Reported performance for the walking vs. cycling binary task is 89.3% accuracy and at min (coarse resampling), rising to 100% accuracy, at min (fine resampling), demonstrating the critical influence of temporal granularity on modal discrimination.
5. Mathematical Formulations
All key feature computations and metric definitions are explicitly detailed in LaTeX in the dataset's documentation. For example, the -th inter-point distance and associated velocity are given as:
and the mean velocity change rate (VCR):
These mathematically precise formulations enable reproducible implementation and allow users to compare against alternative feature computation pipelines or extend the benchmark to additional kinematic descriptors.
6. Real-world Case Study and Practical Guidance
To highlight practical impact and domain transfer, the authors trained their RF classifier on AV-GPS and evaluated it on NTT DOCOMO’s large-scale national trajectory repository, after mapping all segments to AV-GPS feature space and resegmenting at min. The pipeline (1) extracted aforementioned features, (2) predicted mode, and (3) aggregated city-level statistics, reinforcing walking's prevalence in urban cores and identifying elevated biking risk in suburban districts. Reported accuracy on the held-out portion of AV-GPS attains 100% for walking/biking discrimination.
For practitioners, recommended best practices include discarding all samples with GPS-reported error m, using finer sampling intervals when feasible, and exploring feature augmentation (e.g., stop durations, turn angles, map matching). Cross-validation by volunteer is endorsed to accurately estimate model generalization capacity, and gradient boosting methods (e.g., XGBoost, LightGBM) are advised for incremental accuracy gains over baseline random forests. Additional extensions encompass multimodal sensor fusion and domain adaptation for applicability to novel geographies.
7. Access, Licensing, and Citation
The AV-GPS-Dataset is released at https://github.com/AV-GPS/AV-GPS-Dataset under CC BY-NC 4.0. Users must appropriately cite: Jinyu Chen, Haoran Zhang, Xuan Song & Ryosuke Shibasaki (2021), “An open GPS trajectory dataset and benchmark for travel mode detection,” Journal of Location Based Services (Chen et al., 2021).
This dataset addresses a previously unmet need for rigorously labeled, open GPS mobility trajectories, thus providing a foundation for methodological benchmarking, generalization studies, and practical deployment of travel-mode detection models.