OpenACC Datasets: Car-following & Pragma-Loop
- OpenACC Dataset is a dual benchmark resource combining a car-following experiment database for ACC system analysis with a pragma-loop dataset for LLM fine-tuning.
- The car-following database provides precise, high-frequency trajectory data for empirical evaluation of commercial ACC systems in both real-world and controlled settings.
- The pragma-loop dataset aids automated GPU offloading in legacy C/C++ code via supervised fine-tuning, improving directive generation accuracy and performance portability.
OpenACC Dataset refers to two distinct, high-impact open-access resources: (1) the openACC car-following database developed within the European Commission Joint Research Centre’s openData initiative for empirical characterization of commercial Adaptive Cruise Control (ACC) systems (Makridis et al., 2020), and (2) the ACCeLLiuM OpenACC pragma-loop dataset, curated as a supervised fine-tuning corpus for LLMs tasked with automated generation of OpenACC GPU directives for data-parallel C/C++ code (Jhaveri et al., 20 Sep 2025). Both are benchmark datasets in their respective domains, supporting the empirical study of vehicle automation impacts and machine programming for parallel architectures.
1. openACC Car-Following Experiment Database
The openACC car-following experiments database provides transparent, reproducible longitudinal vehicle trajectory data captured in controlled and real-world traffic, explicitly designed for quantifying and comparing the operational properties of state-of-the-art commercial ACC systems. The initiative addresses the scarcity of open information on the heterogeneity and behavioral mechanisms of deployed ACC algorithms, supporting regulatory, microscopic modeling, and safety-oriented research (Makridis et al., 2020).
Experimental Campaigns and Data Acquisition
Three major campaigns (late 2018–2019) form the core of the database, involving 16 vehicles (11 ACC-equipped, 5 human-driven “leaders”):
| Campaign | Test Setting | Vehicles (Types) | Duration/Route |
|---|---|---|---|
| N.1 | Motorways | 2–3-car platoons; human leaders (Fiat 500X, Volvo XC40) | 2 days, 833 km |
| N.2 | Motorways | 5-car mixed platoons; 4 ACC models, 1 human (Mini Cooper) | 3 days, 6.3 h |
| N.3 | Proving ground | 5 high-end ACC-equipped (Tesla, BMW, Mercedes, Audi) | 2 days, 12.6 h, 5.7 km loop |
Each vehicle is instrumented for full-trajectory capture: position (lat, lon, alt, ENU), speed (), inter-vehicle spacing (), driver/ACC engagement, collected at 10 Hz. Positioning technology includes u-blox GNSS (N.1, N.2) and Oxford RT-Range S differential GNSS (N.3), enabling precision up to 2 cm and speed accuracy of 0.02 m/s. When available, CAN/OBD interfaces provide ACC mode and pedal data.
Data Structure, Access, and Licensing
The data is organized into per-campaign directories comprising run-level CSV files (time, position, kinematics, driver mode), vehicle/campaign/specification metadata, and logs of key events. Each entry includes vehicle specifications (make, model, engine, battery, antenna offset), campaign context (dates, special events), and post-processing notes. Data are released under CC-BY 4.0 and can be accessed without registration via direct download (SCP/rsync or web interface).
Key Metrics and Control Models
Researchers have access to ground-truth variables for modeling:
- Time-gap (headway) error:
where (desired spacing at time-headway ).
- Idealized ACC PD response:
- Empirical response time (via cross-correlation):
where .
Comparative Findings and Research Themes
Noteworthy observations include:
- Commercial ACCs maintain smooth accelerations inside m/s² but amplify speed fluctuations (i.e., string instability) under perturbation, unlike human drivers who dampen such waves.
- Empirical controller response times for ACCs (1.6–2.7 s) exceed typical human values (~1.1 s), contradicting assumptions of instantaneous response in simulation models.
- Factory time-headway presets cluster between 1.0–1.4 s (min) and up to 2.5–3.5 s (max), with substantial inter-make variation.
- Tractive energy consumption in ACC platoons is 2.7–20.5% higher than human equivalents, mainly due to overreaction and lack of downstream anticipation.
- Surrogate safety indicators such as minimum Time-to-Collision remain favorable ( s) at minimal settings but degrade with platoon length due to string instability.
This dataset enables research on regulatory requirements (e.g., minimal safe headways, standardizing response metrics), improvements to microsimulation traffic models (e.g., incorporation of ACC delay, inertia), evaluation of energy efficiency strategies (e.g., cooperative-ACC or V2V communication), and analysis of automated emergency braking as fleet mix evolves (Makridis et al., 2020).
2. ACCeLLiuM OpenACC Pragma–Loop Dataset
The ACCeLLiuM OpenACC dataset provides a curated benchmark for supervised fine-tuning of LLMs to generate OpenACC directives for data-parallel C/C++ loops, targeting automated augmentation of legacy code with performance-critical GPU offloading annotations (Jhaveri et al., 20 Sep 2025).
Dataset Construction and Structure
The dataset comprises 4033 unique (pragma, loop) pairs where each entry consists of a minimal user code example (e.g., a serial or nested for-loop fragment after a <TARGET_PRAGMA_LOCATION> marker) and its corresponding, human-written OpenACC pragma. Original clause order and formatting are retained. Stratified by pragma “complexity” (number of clauses), the collection is partitioned into 3223 training and 810 testing instances. Each example is serialized as a JSONL chat-style prompt.
Extraction Pipeline
A three-phase pipeline was developed:
- GitHub Mining: Queried for C/C++ files with OpenACC pragmas (
#pragma acc loopand variants), yielding 1509 files and ≈30,749 candidates. - Syntactic Filtering: Employed Tree-sitter to parse and associate pragmas with the following data-parallel
forloops. Exclusion criteria removed non-data-parallel/redundant code, reducing the corpus to 10,503 valid pairs. - Deduplication and Complexity Stratification: Eliminated duplicates and ensured a representative complexity spectrum (0–2 up to 11+ clauses per pragma), resulting in the final set of 4033 distinct examples.
Corpus Composition: Directives and Clauses
Directive-type and clause distributions for the full dataset are as follows:
| Directive | Count | Percentage |
|---|---|---|
| loop | 2,565 | 63.6 % |
| parallel | 1,262 | 31.3 % |
| kernels | 137 | 3.4 % |
| serial | 22 | 0.5 % |
| other | 47 | 1.2 % |
| Clause | Count | Percentage |
|---|---|---|
| copyin | 2,900 | 71.9 % |
| copyout | 2,650 | 65.7 % |
| present | 2,350 | 58.2 % |
| reduction | 1,600 | 39.7 % |
| collapse | 1,000 | 24.8 % |
| gang | 900 | 22.3 % |
| vector | 600 | 14.9 % |
| private | 400 | 9.9 % |
The clause-distribution vector is:
for {copyin, copyout, present, reduction, collapse, gang, vector, private} respectively.
Use in LLM Supervised Fine-Tuning and Benchmarking
The ACCeLLiuM dataset was used to fine-tune Llama 3.1 70B and CodeLlama 34B via QLoRA adapters (1×NVIDIA H100 80GB, 3 epochs, bf16, learning rate ). On the 810-example test split:
- Base LLMs yield exact-match and directive-type match accuracy.
- Fine-tuned models correctly identify the primary directive in of cases.
- Exact pragma (directive + clauses + clause order + variable) accuracy is .
- Normalized Levenshtein similarity is approximately 0.79; mean Jaccard clause overlap is 0.69.
- Syntactic validation with
compilation reports an successfully compilable rate for LLM-generated pragmas vs. for human references.
Applications include legacy codebase modernization, performance-tuning assistants, and educational tooling for directive-based parallelism. The entire bundle (dataset, code, evaluation, model weights) is MIT-licensed and available on GitHub.
3. Licensing and Open Science Principles
Both datasets are released under open licensing: openACC under CC-BY 4.0 (allowing academic, industrial, regulatory reuse with attribution), ACCeLLiuM under the MIT license (enabling modification and redistribution in both research and commercial settings). This compliance with openData and open-source policy frameworks facilitates reproducibility, extension (including contribution back of new data/metrics), and community-driven benchmarking (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).
4. Research Frontiers and Open Questions
The openACC vehicle data highlights research threads around ACC system design, regulatory requirements (e.g., string stability, minimum headways, emergency response), improved microscopic traffic model calibration (addressing nonzero ACC delays and vehicle inertia), and mixed traffic safety/energy analyses. ACCeLLiuM enables inquiry into automated pragma synthesis, guides performance portability research across heterogeneous architectures, and exposes clause selection and ordering as a learning problem in LLM-based code generation.
Research directions suggested by openACC data include standardizing ACC functional requirements for energy/safety, adapting car-following models to capture empirical controller delays, quantifying the fleet-mix impact on network stability, and extending datasets to encompass automated emergency braking activation logs. ACCeLLiuM points toward improving automated tooling for directive insertion based on code context, clause reasoning, and benchmarking future LLMs on pragma generation (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).
5. Data Access and Community Engagement
Both openACC and ACCeLLiuM are publicly accessible via their respective download portals:
- openACC vehicle dataset: https://jrcbox.jrc.ec.europa.eu/index.php/s/K5M7hiG4YTqAaV1
- ACCeLLiuM OpenACC dataset and models: https://github.com/uci-accellium/accellium
Neither dataset requires registration. Both encourage further contribution of data and derived results to foster reference benchmarks for vehicle automation and directive-driven code parallelization, supporting robust cross-group comparison and sustained open-science momentum (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).