OpenACC Datasets: Car-following & Pragma-Loop

Updated 31 January 2026

OpenACC Dataset is a dual benchmark resource combining a car-following experiment database for ACC system analysis with a pragma-loop dataset for LLM fine-tuning.
The car-following database provides precise, high-frequency trajectory data for empirical evaluation of commercial ACC systems in both real-world and controlled settings.
The pragma-loop dataset aids automated GPU offloading in legacy C/C++ code via supervised fine-tuning, improving directive generation accuracy and performance portability.

OpenACC Dataset refers to two distinct, high-impact open-access resources: (1) the openACC car-following database developed within the European Commission Joint Research Centre’s openData initiative for empirical characterization of commercial Adaptive Cruise Control (ACC) systems (Makridis et al., 2020), and (2) the ACCeLLiuM OpenACC pragma-loop dataset, curated as a supervised fine-tuning corpus for LLMs tasked with automated generation of OpenACC GPU directives for data-parallel C/C++ code (Jhaveri et al., 20 Sep 2025). Both are benchmark datasets in their respective domains, supporting the empirical study of vehicle automation impacts and machine programming for parallel architectures.

1. openACC Car-Following Experiment Database

The openACC car-following experiments database provides transparent, reproducible longitudinal vehicle trajectory data captured in controlled and real-world traffic, explicitly designed for quantifying and comparing the operational properties of state-of-the-art commercial ACC systems. The initiative addresses the scarcity of open information on the heterogeneity and behavioral mechanisms of deployed ACC algorithms, supporting regulatory, microscopic modeling, and safety-oriented research (Makridis et al., 2020).

Experimental Campaigns and Data Acquisition

Three major campaigns (late 2018–2019) form the core of the database, involving 16 vehicles (11 ACC-equipped, 5 human-driven “leaders”):

Campaign	Test Setting	Vehicles (Types)	Duration/Route
N.1	Motorways	2–3-car platoons; human leaders (Fiat 500X, Volvo XC40)	2 days, 833 km
N.2	Motorways	5-car mixed platoons; 4 ACC models, 1 human (Mini Cooper)	3 days, 6.3 h
N.3	Proving ground	5 high-end ACC-equipped (Tesla, BMW, Mercedes, Audi)	2 days, 12.6 h, 5.7 km loop

Each vehicle is instrumented for full-trajectory capture: position (lat, lon, alt, ENU), speed ( $v(t)$ ), inter-vehicle spacing ( $s(t)$ ), driver/ACC engagement, collected at 10 Hz. Positioning technology includes u-blox GNSS (N.1, N.2) and Oxford RT-Range S differential GNSS (N.3), enabling precision up to 2 cm and speed accuracy of 0.02 m/s. When available, CAN/OBD interfaces provide ACC mode and pedal data.

Data Structure, Access, and Licensing

The data is organized into per-campaign directories comprising run-level CSV files (time, position, kinematics, driver mode), vehicle/campaign/specification metadata, and logs of key events. Each entry includes vehicle specifications (make, model, engine, battery, antenna offset), campaign context (dates, special events), and post-processing notes. Data are released under CC-BY 4.0 and can be accessed without registration via direct download (SCP/rsync or web interface).

Key Metrics and Control Models

Researchers have access to ground-truth variables for modeling:

Time-gap (headway) error:

$E_{\mathrm{gap}} = \frac{1}{T} \int_0^T [s(t) - s^*(v(t))]^2\,\mathrm{d}t$

where $s^*(v) = h_{\mathrm{set}}\,v$ (desired spacing at time-headway $h_{\mathrm{set}}$ ).

Idealized ACC PD response:

$a(t) = K_p[s(t)-s^*(v(t))] + K_d\,\dot s(t)$

Empirical response time (via cross-correlation):

$\tau^* = \arg\max_\tau\mathrm{Corr}(\Delta v(t-\tau),\,a_\mathrm{follow}(t))$

where $\Delta v(t) = v_\mathrm{lead}(t) - v_\mathrm{follow}(t)$ .

Comparative Findings and Research Themes

Noteworthy observations include:

Commercial ACCs maintain smooth accelerations inside $\pm1$ m/s² but amplify speed fluctuations (i.e., string instability) under perturbation, unlike human drivers who dampen such waves.
Empirical controller response times $\tau^*$ for ACCs (1.6–2.7 s) exceed typical human values (~1.1 s), contradicting assumptions of instantaneous response in simulation models.
Factory time-headway presets cluster between 1.0–1.4 s (min) and up to 2.5–3.5 s (max), with substantial inter-make variation.
Tractive energy consumption in ACC platoons is 2.7–20.5% higher than human equivalents, mainly due to overreaction and lack of downstream anticipation.
Surrogate safety indicators such as minimum Time-to-Collision remain favorable ( $\geq 4$ s) at minimal settings but degrade with platoon length due to string instability.

This dataset enables research on regulatory requirements (e.g., minimal safe headways, standardizing response metrics), improvements to microsimulation traffic models (e.g., incorporation of ACC delay, inertia), evaluation of energy efficiency strategies (e.g., cooperative-ACC or V2V communication), and analysis of automated emergency braking as fleet mix evolves (Makridis et al., 2020).

2. ACCeLLiuM OpenACC Pragma–Loop Dataset

The ACCeLLiuM OpenACC dataset provides a curated benchmark for supervised fine-tuning of LLMs to generate OpenACC directives for data-parallel C/C++ loops, targeting automated augmentation of legacy code with performance-critical GPU offloading annotations (Jhaveri et al., 20 Sep 2025).

Dataset Construction and Structure

The dataset comprises 4033 unique (pragma, loop) pairs where each entry consists of a minimal user code example (e.g., a serial or nested for-loop fragment after a <TARGET_PRAGMA_LOCATION> marker) and its corresponding, human-written OpenACC pragma. Original clause order and formatting are retained. Stratified by pragma “complexity” (number of clauses), the collection is partitioned into 3223 training and 810 testing instances. Each example is serialized as a JSONL chat-style prompt.

Extraction Pipeline

A three-phase pipeline was developed:

GitHub Mining: Queried for C/C++ files with OpenACC pragmas (#pragma acc loop and variants), yielding 1509 files and ≈30,749 candidates.
Syntactic Filtering: Employed Tree-sitter to parse and associate pragmas with the following data-parallel for loops. Exclusion criteria removed non-data-parallel/redundant code, reducing the corpus to 10,503 valid pairs.
Deduplication and Complexity Stratification: Eliminated duplicates and ensured a representative complexity spectrum (0–2 up to 11+ clauses per pragma), resulting in the final set of 4033 distinct examples.

Corpus Composition: Directives and Clauses

Directive-type and clause distributions for the full dataset are as follows:

Directive	Count	Percentage
loop	2,565	63.6 %
parallel	1,262	31.3 %
kernels	137	3.4 %
serial	22	0.5 %
other	47	1.2 %

Clause	Count	Percentage
copyin	2,900	71.9 %
copyout	2,650	65.7 %
present	2,350	58.2 %
reduction	1,600	39.7 %
collapse	1,000	24.8 %
gang	900	22.3 %
vector	600	14.9 %
private	400	9.9 %

The clause-distribution vector is:

$\{0.719,\,0.657,\,0.582,\,0.397,\,0.248,\,0.223,\,0.149,\,0.099\}$

for {copyin, copyout, present, reduction, collapse, gang, vector, private} respectively.

Use in LLM Supervised Fine-Tuning and Benchmarking

The ACCeLLiuM dataset was used to fine-tune Llama 3.1 70B and CodeLlama 34B via QLoRA adapters (1×NVIDIA H100 80GB, 3 epochs, bf16, learning rate $6\times 10^{-5}$ ). On the 810-example test split:

Base LLMs yield $\sim0\%$ exact-match and $<20\%$ directive-type match accuracy.
Fine-tuned models correctly identify the primary directive in $87\%$ of cases.
Exact pragma (directive + clauses + clause order + variable) accuracy is $50\%$ .
Normalized Levenshtein similarity is approximately 0.79; mean Jaccard clause overlap is 0.69.
Syntactic validation with $-acc$ compilation reports an $81\%$ successfully compilable rate for LLM-generated pragmas vs. $89\%$ for human references.

Applications include legacy codebase modernization, performance-tuning assistants, and educational tooling for directive-based parallelism. The entire bundle (dataset, code, evaluation, model weights) is MIT-licensed and available on GitHub.

3. Licensing and Open Science Principles

Both datasets are released under open licensing: openACC under CC-BY 4.0 (allowing academic, industrial, regulatory reuse with attribution), ACCeLLiuM under the MIT license (enabling modification and redistribution in both research and commercial settings). This compliance with openData and open-source policy frameworks facilitates reproducibility, extension (including contribution back of new data/metrics), and community-driven benchmarking (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).

4. Research Frontiers and Open Questions

The openACC vehicle data highlights research threads around ACC system design, regulatory requirements (e.g., string stability, minimum headways, emergency response), improved microscopic traffic model calibration (addressing nonzero ACC delays and vehicle inertia), and mixed traffic safety/energy analyses. ACCeLLiuM enables inquiry into automated pragma synthesis, guides performance portability research across heterogeneous architectures, and exposes clause selection and ordering as a learning problem in LLM-based code generation.

Research directions suggested by openACC data include standardizing ACC functional requirements for energy/safety, adapting car-following models to capture empirical controller delays, quantifying the fleet-mix impact on network stability, and extending datasets to encompass automated emergency braking activation logs. ACCeLLiuM points toward improving automated tooling for directive insertion based on code context, clause reasoning, and benchmarking future LLMs on pragma generation (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).

5. Data Access and Community Engagement

Both openACC and ACCeLLiuM are publicly accessible via their respective download portals:

openACC vehicle dataset: https://jrcbox.jrc.ec.europa.eu/index.php/s/K5M7hiG4YTqAaV1
ACCeLLiuM OpenACC dataset and models: https://github.com/uci-accellium/accellium

Neither dataset requires registration. Both encourage further contribution of data and derived results to foster reference benchmarks for vehicle automation and directive-driven code parallelization, supporting robust cross-group comparison and sustained open-science momentum (Makridis et al., 2020, Jhaveri et al., 20 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

openACC. An open database of car-following experiments to study the properties of commercial ACC systems (2020)

ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenACC Dataset.