Papers
Topics
Authors
Recent
Search
2000 character limit reached

AV-GPS-Dataset for Travel Mode Detection

Updated 23 February 2026
  • AV-GPS-Dataset is a curated GPS trajectory set with labeled travel modes (walking, bicycle, bus, railway) collected across varied urban environments.
  • It employs a rigorous month-long data collection with high spatial precision and systematic preprocessing steps like outlier removal and subsampling.
  • Benchmark evaluations using Random Forest classifiers demonstrate its effectiveness in deciphering urban mobility patterns and ensuring reproducibility.

The AV-GPS-Dataset is a publicly available, multi-modal GPS trajectory dataset specifically curated and benchmarked for travel mode detection in the context of human mobility research. It constitutes the first openly licensed resource to provide fine-grained ground-truth modal labels (walking, bicycle, bus, railway) for raw GPS trajectories over a substantial period and across varied urban environments. Released under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license, the AV-GPS-Dataset supports reproducible research in travel mode classification, robust feature extraction, and transferability analysis for mobility modeling (Chen et al., 2021).

1. Data Collection Protocol

The AV-GPS-Dataset was compiled using a month-long campaign (October 2020) with seven independent volunteers traversing various regions of the Tokyo/Kashiwa area in Japan, encompassing dense urban cores, suburban districts, rural segments, and metropolitan parks. Ground-truth annotation covers 212 walking trips, 138 bicycle segments, 56 bus rides, and 69 railway journeys, each with a minimum duration of 10 minutes. Data were captured using Android smartphones running a custom logging application. The system sampled GPS at 1 Hz, yielding raw spatial precision of approximately 2–3 meters, with error values recorded to 10710^{-7} meter resolution. This high-frequency raw logging permits detailed shape-based analysis but was later systematically subsampled to intervals of 1–5 minutes to emulate the temporal sparsity found in production-scale mobility records.

2. Data Organization, Schema, and Preprocessing

Each datum comprises a timestamp (UTC, ISO 8601), latitude/longitude (WGS84), estimated horizontal error, volunteer identifier, trip identifier, and mode label. Data are distributed as per-point CSV and GeoJSON records, hierarchically organized by volunteer and mode label:

1
2
3
4
5
6
7
8
9
AV-GPS/
├─ volunteer_01/
│   ├─ walking/
│   │   ├─ trip_01.csv
│   │   └─ …
│   └─ bicycle/
│       └─ …
├─ volunteer_02/
└─ …

Three primary preprocessing steps were performed:

  1. Outlier removal: All points with estimated error >Q3+1.5×IQR> Q3 + 1.5 \times \mathrm{IQR} or >93.1> 93.1 m were excluded.
  2. Subsampling: Trajectories for downstream benchmarks were resampled to intervals Δt{1,2,3,4,5}\Delta t \in \{1,2,3,4,5\} minutes using a closest-timestamp policy.
  3. Feature generation: Distance, velocity, and kinematic features were calculated on each subsampled segment post-cleaning.

3. Annotation, Quality Control, and Labeling

Travel mode was defined into four mutually exclusive classes: walking (mean speed $1$–$7$ km/h), bicycle (non-motorized, $5$–$25$ km/h), bus (surface transit with frequent stops), and railway (heavy/commercial rail lines, typically higher average speeds). Volunteers manually selected their current mode at trip commencement and were required to terminate/restart trip logs upon mode change. Modal transitions (e.g., walk-to-bus) are thus precise at the minute scale. Multiple repetitions of routes across different diurnal cycles and traffic conditions were performed to ensure sample diversity and test robustness to environmental variability. Additional annotation validation included minimum trip duration enforcement (10\geq 10 min), random cross-checks against Google Maps’ published transit schedules, and secondary error-based outlier filtering.

4. Benchmark Tasks, Feature Set, and Evaluation

The canonical task is modal classification: mapping a subsampled trajectory segment (at Δt=1\Delta t = 1–$5$ min) to a mode m{walk,bike,bus,rail}m \in \{\mathrm{walk}, \mathrm{bike}, \mathrm{bus}, \mathrm{rail}\}. The benchmark uses a Random Forest classifier (100 trees, unlimited depth), trained on 80% of the trips and validated via 10-fold cross-validation (shuffled at the trip level for generalization integrity).

The standardized feature set FF for inference comprises:

# Name Formula / Definition
1 Distance i=1n1Dis(Pi,Pi+1)\sum_{i=1}^{n-1} \mathrm{Dis}(P_i, P_{i+1})
2 Time tnt1t_n - t_1
3 Points nn
4 VCR (mean) 1n2i=1n2vi+1viΔt\frac{1}{n-2} \sum_{i=1}^{n-2} \frac{v_{i+1}-v_i}{\Delta t}
5 MVCR (max) maxivi+1viΔt\max_{i} \frac{v_{i+1}-v_i}{\Delta t}
6 MaxAccel maxivi+1viΔt\max_{i} \frac{v_{i+1}-v_i}{\Delta t}
7 AvgSpeed₁ Distance/Time\mathrm{Distance} / \mathrm{Time}
8 MinSpeed minivi\min_{i} v_i
9 MaxSpeed maxivi\max_{i} v_i
10 AvgSpeed₂ 1n2i=1n2vi\frac{1}{n-2} \sum_{i=1}^{n-2} v_i

where vi=Dis(Pi,Pi+1)ti+1tiv_i = \frac{\mathrm{Dis}(P_i,P_{i+1})}{t_{i+1} - t_i} and Dis(,)\mathrm{Dis}(\cdot,\cdot) denotes haversine distance.

Reported performance for the walking vs. cycling binary task is 89.3% accuracy and F1=0.88\mathrm{F1}=0.88 at Δt=5\Delta t = 5 min (coarse resampling), rising to 100% accuracy, F1=0.99\mathrm{F1}=0.99 at Δt=1\Delta t = 1 min (fine resampling), demonstrating the critical influence of temporal granularity on modal discrimination.

5. Mathematical Formulations

All key feature computations and metric definitions are explicitly detailed in LaTeX in the dataset's documentation. For example, the ii-th inter-point distance and associated velocity are given as:

Di=Dis(Pi,Pi+1),vi=Diti+1tiD_i = \mathrm{Dis}(P_i, P_{i+1}), \quad v_i = \frac{D_i}{t_{i+1} - t_i}

and the mean velocity change rate (VCR):

VCR=1n2i=1n2vi+1viti+1ti\mathrm{VCR} = \frac{1}{n-2}\sum_{i=1}^{n-2}\frac{v_{i+1} - v_i}{t_{i+1} - t_i}

These mathematically precise formulations enable reproducible implementation and allow users to compare against alternative feature computation pipelines or extend the benchmark to additional kinematic descriptors.

6. Real-world Case Study and Practical Guidance

To highlight practical impact and domain transfer, the authors trained their RF classifier on AV-GPS and evaluated it on NTT DOCOMO’s large-scale national trajectory repository, after mapping all segments to AV-GPS feature space and resegmenting at Δt=5\Delta t=5 min. The pipeline (1) extracted aforementioned features, (2) predicted mode, and (3) aggregated city-level statistics, reinforcing walking's prevalence in urban cores and identifying elevated biking risk in suburban districts. Reported accuracy on the held-out portion of AV-GPS attains 100% for walking/biking discrimination.

For practitioners, recommended best practices include discarding all samples with GPS-reported error >50>50 m, using finer sampling intervals when feasible, and exploring feature augmentation (e.g., stop durations, turn angles, map matching). Cross-validation by volunteer is endorsed to accurately estimate model generalization capacity, and gradient boosting methods (e.g., XGBoost, LightGBM) are advised for incremental accuracy gains over baseline random forests. Additional extensions encompass multimodal sensor fusion and domain adaptation for applicability to novel geographies.

7. Access, Licensing, and Citation

The AV-GPS-Dataset is released at https://github.com/AV-GPS/AV-GPS-Dataset under CC BY-NC 4.0. Users must appropriately cite: Jinyu Chen, Haoran Zhang, Xuan Song & Ryosuke Shibasaki (2021), “An open GPS trajectory dataset and benchmark for travel mode detection,” Journal of Location Based Services (Chen et al., 2021).

This dataset addresses a previously unmet need for rigorously labeled, open GPS mobility trajectories, thus providing a foundation for methodological benchmarking, generalization studies, and practical deployment of travel-mode detection models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AV-GPS-Dataset.