AV-GPS-Dataset for Travel Mode Detection

Updated 23 February 2026

AV-GPS-Dataset is a curated GPS trajectory set with labeled travel modes (walking, bicycle, bus, railway) collected across varied urban environments.
It employs a rigorous month-long data collection with high spatial precision and systematic preprocessing steps like outlier removal and subsampling.
Benchmark evaluations using Random Forest classifiers demonstrate its effectiveness in deciphering urban mobility patterns and ensuring reproducibility.

The AV-GPS-Dataset is a publicly available, multi-modal GPS trajectory dataset specifically curated and benchmarked for travel mode detection in the context of human mobility research. It constitutes the first openly licensed resource to provide fine-grained ground-truth modal labels (walking, bicycle, bus, railway) for raw GPS trajectories over a substantial period and across varied urban environments. Released under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license, the AV-GPS-Dataset supports reproducible research in travel mode classification, robust feature extraction, and transferability analysis for mobility modeling (Chen et al., 2021).

1. Data Collection Protocol

The AV-GPS-Dataset was compiled using a month-long campaign (October 2020) with seven independent volunteers traversing various regions of the Tokyo/Kashiwa area in Japan, encompassing dense urban cores, suburban districts, rural segments, and metropolitan parks. Ground-truth annotation covers 212 walking trips, 138 bicycle segments, 56 bus rides, and 69 railway journeys, each with a minimum duration of 10 minutes. Data were captured using Android smartphones running a custom logging application. The system sampled GPS at 1 Hz, yielding raw spatial precision of approximately 2–3 meters, with error values recorded to $10^{-7}$ meter resolution. This high-frequency raw logging permits detailed shape-based analysis but was later systematically subsampled to intervals of 1–5 minutes to emulate the temporal sparsity found in production-scale mobility records.

2. Data Organization, Schema, and Preprocessing

Each datum comprises a timestamp (UTC, ISO 8601), latitude/longitude (WGS84), estimated horizontal error, volunteer identifier, trip identifier, and mode label. Data are distributed as per-point CSV and GeoJSON records, hierarchically organized by volunteer and mode label:

$\Delta t \in \{1,2,3,4,5\}$ 4

Three primary preprocessing steps were performed:

Outlier removal: All points with estimated error $> Q3 + 1.5 \times \mathrm{IQR}$ or $> 93.1$ m were excluded.
Subsampling: Trajectories for downstream benchmarks were resampled to intervals $\Delta t \in \{1,2,3,4,5\}$ minutes using a closest-timestamp policy.
Feature generation: Distance, velocity, and kinematic features were calculated on each subsampled segment post-cleaning.

3. Annotation, Quality Control, and Labeling

Travel mode was defined into four mutually exclusive classes: walking (mean speed $1$–$7$ km/h), bicycle (non-motorized, $5$–$25$ km/h), bus (surface transit with frequent stops), and railway (heavy/commercial rail lines, typically higher average speeds). Volunteers manually selected their current mode at trip commencement and were required to terminate/restart trip logs upon mode change. Modal transitions (e.g., walk-to-bus) are thus precise at the minute scale. Multiple repetitions of routes across different diurnal cycles and traffic conditions were performed to ensure sample diversity and test robustness to environmental variability. Additional annotation validation included minimum trip duration enforcement ( $\geq 10$ min), random cross-checks against Google Maps’ published transit schedules, and secondary error-based outlier filtering.

4. Benchmark Tasks, Feature Set, and Evaluation

The canonical task is modal classification: mapping a subsampled trajectory segment (at $\Delta t = 1$ – $> Q3 + 1.5 \times \mathrm{IQR}$ 0 min) to a mode $> Q3 + 1.5 \times \mathrm{IQR}$ 1. The benchmark uses a Random Forest classifier (100 trees, unlimited depth), trained on 80% of the trips and validated via 10-fold cross-validation (shuffled at the trip level for generalization integrity).

The standardized feature set $> Q3 + 1.5 \times \mathrm{IQR}$ 2 for inference comprises:

#	Name	Formula / Definition
1	Distance	$> Q3 + 1.5 \times \mathrm{IQR}$ 3
2	Time	$> Q3 + 1.5 \times \mathrm{IQR}$ 4
3	Points	$> Q3 + 1.5 \times \mathrm{IQR}$ 5
4	VCR (mean)	$> Q3 + 1.5 \times \mathrm{IQR}$ 6
5	MVCR (max)	$> Q3 + 1.5 \times \mathrm{IQR}$ 7
6	MaxAccel	$> Q3 + 1.5 \times \mathrm{IQR}$ 8
7	AvgSpeed₁	$> Q3 + 1.5 \times \mathrm{IQR}$ 9
8	MinSpeed	$> 93.1$ 0
9	MaxSpeed	$> 93.1$ 1
10	AvgSpeed₂	$> 93.1$ 2

where $> 93.1$ 3 and $> 93.1$ 4 denotes haversine distance.

Reported performance for the walking vs. cycling binary task is 89.3% accuracy and $> 93.1$ 5 at $> 93.1$ 6 min (coarse resampling), rising to 100% accuracy, $> 93.1$ 7 at $> 93.1$ 8 min (fine resampling), demonstrating the critical influence of temporal granularity on modal discrimination.

5. Mathematical Formulations

All key feature computations and metric definitions are explicitly detailed in LaTeX in the dataset's documentation. For example, the $> 93.1$ 9-th inter-point distance and associated velocity are given as:

$\Delta t \in \{1,2,3,4,5\}$ 0

and the mean velocity change rate (VCR):

$\Delta t \in \{1,2,3,4,5\}$ 1

These mathematically precise formulations enable reproducible implementation and allow users to compare against alternative feature computation pipelines or extend the benchmark to additional kinematic descriptors.

6. Real-world Case Study and Practical Guidance

To highlight practical impact and domain transfer, the authors trained their RF classifier on AV-GPS and evaluated it on NTT DOCOMO’s large-scale national trajectory repository, after mapping all segments to AV-GPS feature space and resegmenting at $\Delta t \in \{1,2,3,4,5\}$ 2 min. The pipeline (1) extracted aforementioned features, (2) predicted mode, and (3) aggregated city-level statistics, reinforcing walking's prevalence in urban cores and identifying elevated biking risk in suburban districts. Reported accuracy on the held-out portion of AV-GPS attains 100% for walking/biking discrimination.

For practitioners, recommended best practices include discarding all samples with GPS-reported error $\Delta t \in \{1,2,3,4,5\}$ 3 m, using finer sampling intervals when feasible, and exploring feature augmentation (e.g., stop durations, turn angles, map matching). Cross-validation by volunteer is endorsed to accurately estimate model generalization capacity, and gradient boosting methods (e.g., XGBoost, LightGBM) are advised for incremental accuracy gains over baseline random forests. Additional extensions encompass multimodal sensor fusion and domain adaptation for applicability to novel geographies.

7. Access, Licensing, and Citation

The AV-GPS-Dataset is released at https://github.com/AV-GPS/AV-GPS-Dataset under CC BY-NC 4.0. Users must appropriately cite: Jinyu Chen, Haoran Zhang, Xuan Song & Ryosuke Shibasaki (2021), “An open GPS trajectory dataset and benchmark for travel mode detection,” Journal of Location Based Services (Chen et al., 2021).

This dataset addresses a previously unmet need for rigorously labeled, open GPS mobility trajectories, thus providing a foundation for methodological benchmarking, generalization studies, and practical deployment of travel-mode detection models.

Markdown Report Issue Upgrade to Chat

References (1)

An open GPS trajectory dataset and benchmark for travel mode detection (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AV-GPS-Dataset.

AV-GPS-Dataset for Travel Mode Detection

1. Data Collection Protocol

2. Data Organization, Schema, and Preprocessing

3. Annotation, Quality Control, and Labeling

4. Benchmark Tasks, Feature Set, and Evaluation

5. Mathematical Formulations

6. Real-world Case Study and Practical Guidance

7. Access, Licensing, and Citation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AV-GPS-Dataset for Travel Mode Detection

1. Data Collection Protocol

2. Data Organization, Schema, and Preprocessing

3. Annotation, Quality Control, and Labeling

4. Benchmark Tasks, Feature Set, and Evaluation

5. Mathematical Formulations

6. Real-world Case Study and Practical Guidance

7. Access, Licensing, and Citation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research