High-Fidelity Electric Vehicle Dataset

Updated 19 September 2025

High-fidelity EV datasets are comprehensive time-series data capturing diverse electric vehicle metrics including battery, dynamics, and auxiliary signals.
They utilize advanced acquisition methods such as OBD-II and CAN bus interfacing to ensure high temporal resolution and data standardization, enabling accurate energy modeling and diagnostics.
The datasets support research applications in energy consumption modeling, eco-driving strategies, machine learning-based anomaly detection, and infrastructure planning.

A high-fidelity electric vehicle (EV) dataset is a data resource that captures detailed, temporally resolved measurements on the operation, energy consumption, dynamics, auxiliary usage, and often environmental or behavioral context of electric vehicles under real-world or high-resolution simulation conditions. These datasets are critical for data-driven modeling, machine learning applications, energy consumption analysis, eco-driving and routing research, infrastructure planning, and the development of advanced EV control, diagnostics, or anomaly detection systems. The defining attributes of such datasets are their temporal granularity, breadth of signals (including both traditional and electric-specific metrics), and representativeness with respect to operational variability and privacy preservation.

1. Definition and Core Characteristics

A high-fidelity EV dataset is defined by its detailed coverage of electric vehicle operation through time-series data capturing both conventional and electric powertrain signals. The dataset includes battery-specific measurements such as state-of-charge (SOC, in %), voltage (V), current (A), auxiliary power usage (air conditioning and heating, W or kW), and high-precision geospatial (GPS) information. For plug-in hybrid electric vehicles (PHEVs), the data also includes engine-related metrics, enabling paper of operational mode switching. Essential to fidelity is the fixed and frequent sampling (down to 1 Hz or finer), standardization of recorded signals, and precise timestamping.

The VED dataset (Oh et al., 2019) exemplifies these traits, offering second-level (3s-latency for GPS, 1s for vehicle speed) reporting of SOC, voltage, and current, coupled with auxiliary loads and GPS traces at 7-digit latitude/longitude precision and coverage of all operating conditions across a large, heterogeneous fleet. The dataset allows instantiation of instantaneous power via $P(t) = V(t) \times I(t)$ and supports cumulative energy calculation and efficiency derivation over both spatial and temporal axes.

2. Data Acquisition and Processing Methodologies

High-fidelity EV datasets use direct vehicle-bus interfacing via OBD-II loggers or manufacturer-specific CAN bus connections to access raw controller data. Data collection typically spans a wide range of operational conditions—urban, suburban, and highway driving; variable weather and seasons; and driver-specific behavioral variance. Wide deployment (e.g., 383 vehicles in VED; hundreds in EVBattery (He et al., 2022)) across an extensive geographical area or, for simulation datasets, synthesized coverage of comprehensive networks (e.g., Greater Chicago in (Moawad et al., 2021)) is essential for representativeness.

Quality is further ensured through standardized protocol usage (e.g., SAE J1979, J1962), rigorous de-identification for privacy (random fogging, geo-fencing, spatial bounding), and multimodal data merging (combining vehicle telemetry, traffic, weather, and map/elevation data). Cleaning phases remove spurious, non-travel periods and outlier datapoints (e.g., negative or implausible energy values), and alignment algorithms overcome differences in sample rate or GPS noise via context-aware mapping (such as sliding windows for road-matching (Ayman et al., 2020)).

Following acquisition, datasets are often segmented by trip, road segment, or fixed-length window to support downstream analysis that requires uniformity (e.g., snippets of 128 consecutive samples for battery health studies (He et al., 2022)). Datasets are organized as CSV or hierarchical file structures, supporting analysis at multiple levels (trip, segment, window, etc.).

3. Key Recorded Metrics and Data Structures

A comprehensive high-fidelity EV dataset includes the following core data elements:

Class	Typical Signals	Temporal Resolution
Powertrain	Battery SOC (%), Voltage (V), Current (A); For ICE mode: MAF, RPM, fuel	1–3 sec or finer
Aux Loads	AC Power (kW/W), Heater Power (W)	1–10 sec
Dynamics	Speed (km/h or m/s), Acceleration (m/s²), Motor Torque (N·m), RPM	1 sec
Location	Latitude/Longitude (WGS84, high-prec.), Trip Start/End Times	1–3 sec
Battery	Cell/pack temperature (°C), voltage, min/max cell voltages	1–10 sec or per charge
Charging	Power (kW), energy delivered (kWh), session time, charger ID/type	1–60 sec (if session)
Clinics/Meta	Vehicle ID, anonymized IDs, manufacturer, session/trip index	per-event/trip

Data is frequently organized into time-aligned multi-channel arrays (e.g., $X \in \mathbb{R}^{T \times D}$ for T timestamps, D channels) or as trip-wise tensors, with supporting meta-information and labels (vehicle, session, or fault/health annotations).

4. Analytical Uses and Research Applications

The breadth of high-fidelity EV datasets enables multifaceted research:

Energy Consumption Modeling: Construction of predictive models (regression, DNN, or hybrid physics-ML) to estimate instantaneous or cumulative energy use as a function of speed, acceleration, auxiliary usage, ambient/environment, driver, and routing context. Instantaneous energy is formalized as $P(t) = V(t) \times I(t)$ , and for route-level modeling, cumulative metrics are used as $E = \sum_{i} P_i \Delta t_i$ .
Machine Learning and Driver Modeling: With second-by-second data (or finer), advanced ML architectures—including LSTM, Transformer, CNN-BiLSTM cascades—can model temporal dependencies, behavior, and auxiliary energy use under varying operational and environmental regimes (Moawad et al., 2021, Yahyaabadi et al., 11 Aug 2025).
Traffic Simulation Calibration: The rich spatio-temporal content enables calibration and validation of micro- or mesoscopic traffic models (e.g., via POLARIS or SUMO), using trip-level or link-level energy consumption as the measurement target for eco-routing, infrastructure effect studies, and emergent congestion impact analysis.
Eco-driving and Control Strategies: Case studies show quantifiable energy reduction via eco-routing (route selection) or control intervention (e.g., pre-emptive acceleration to catch a “green wave”). Quantitative outcomes include >30% fuel savings in optimal signal coordination scenarios (Oh et al., 2019).
Battery Health and Capacity Estimation: Datasets that record charge session snippets (multi-variate time series) and assign binary or continuous labels (anomaly, capacity) support both unsupervised anomaly detection (autoencoders, dynamic decoders) and regression tasks for prognostics (He et al., 2022).
Synthetic Data Generation: Deep generative models (e.g., DeepAR, N-BEATS, DeepTCN (Channegowda et al., 2023)) trained on available real data to augment sparse datasets with high-fidelity artificial samples, enabling richer model training and robustness against overfitting.

5. Case Studies, Key Results, and Insights

Analyses leveraging high-fidelity EV datasets consistently yield high-resolution insights into operational efficiency and the impacts of environment, technology, and behavior:

Environmental Impacts: Fuel/energy economy is significantly affected by speed (optimal MPG at 16–20 m/s), environment (highways outperform city driving by factor of ~3), and season (summertime economy up to 25% higher than winter) (Oh et al., 2019).
EV-Specific Energy Use: Link-level energy efficiency results from post-processed PHEV trips align with regulatory metrics (e.g., single-trip efficiency at 138 MPGe in city driving (Oh et al., 2019)), with elevation and speed variance identified as dominant correlates for transit fleet energy consumption (Ayman et al., 2020).
Operational Patterns and Variability: Analysis of aggregate charging behavior demonstrates predictable temporal clusters (bimodal morning and evening arrivals (Hashmi et al., 2 Sep 2024)); most real-world sessions are under 4 hours and under 10 kWh, with joint distribution methods (Roulette Wheel Selection) able to replicate these statistics for scenario generation.
Battery Diagnostics: Combined real-world records across several manufacturers and years (e.g., 1.2 million charging snippets in EVBattery (He et al., 2022)) allow robust anomaly and regression modeling. Dynamic auto-decoders leveraging top-h% error aggregation overcome the rarity and vehicle-level nature of anomaly labels.

6. Privacy, Access, and Dataset Limitations

Ethical and practical constraints motivate de-identification and data sanitization pipelines. These include spatial fogging, geo-fencing to obscure fine location data, and removal of calibration features that could leak user identity—all while maintaining sufficient spatio-temporal fidelity for detailed longitudinal research. All usage is governed by standard academic citation and open-source licensing terms where applicable (e.g., VED at https://github.com/gsoh/VED).

Notable dataset limitations are:

Sampling Inconsistency: Smartphone-based OBD sources can introduce variable temporal resolutions and may experience data drop-outs or intermittent recording (Osonuga et al., 5 Mar 2024).
Label Scarcity: In battery anomaly datasets, vehicle-level labels can make instance-level supervised learning unreliable; semi-supervised or unsupervised methods with robust aggregation become essential (He et al., 2022).
OEM Heterogeneity: Variations in available OBD-II PIDs and proprietary signals can limit cross-fleet standardization, requiring detailed mapping or interpolation of missing signals.

7. Future Directions and Research Potential

Emerging lines of inquiry and dataset expansions include:

Larger and More Diverse Datasets: Improvements in fidelity through higher-frequency native CAN bus streaming, cross-OEM standardization, and expansion into new usage contexts (shared mobility, ride-hailing).
Physics-Informed Learning: Fusion of high-fidelity measurement with first-principles vehicle modeling (e.g., hybrid neural-physics models) to improve interpretability and robustness of predictions.
Grid and Socio-Technical Analysis: Integration with grid simulation and joint traffic-load studies (see GreenEVT (Nilsson et al., 2023)) to elucidate the combined dynamics of vehicle charging and electric infrastructure stresses, especially during heavy-load scenarios (e.g., evacuations).
Synthetic and Scenario Datasets: Continued development of stochastic and generative toolkits (such as FlexiGen (Cabral et al., 11 Nov 2024)), producing parametric scenario datasets for V2G/V1G demand response algorithm evaluation, with configurable user behavior and traffic perturbations.
Privacy-Aware Data Sharing: Advancement of synthetic and federated approaches to enable broader data availability without compromising participant privacy or commercial sensitivity.

High-fidelity electric vehicle datasets thus represent a foundational resource for advancing empirical automotive research, enabling comprehensive data-driven modeling, algorithm development, and the deployment of next-generation electric mobility solutions.