RO-Crate Bundle: Standardized Research Package
- RO-Crate Bundle is a self-contained archive that packages all digital research artefacts with a central JSON-LD metadata file to support FAIR principles.
- It aggregates diverse items such as data, code, workflows, and provenance details using a minimal Schema.org-based JSON-LD structure for efficient lookup and traceability.
- Widely adopted across fields like bioinformatics, regulatory science, and cultural heritage, its design enhances reproducibility and interoperability across repositories.
A RO-Crate Bundle is a standards-driven, lightweight, self-contained archive designed for packaging all digital artefacts, along with structured, machine- and human-readable metadata, that collectively constitute a research outcome. The RO-Crate approach formalizes the organization, annotation, and dissemination of research data, code, workflows, documentation, provenance, and external references. Central to its design is a single JSON-LD metadata file adhering to a minimal Linked Data profile of Schema.org, which ensures robust support for FAIR (Findable, Accessible, Interoperable, Reusable) principles while remaining accessible to both automated tools and human users. RO-Crate and its specialized profiles, such as Workflow Run RO-Crate, are actively adopted across bioinformatics, regulatory science, cultural heritage, and other domains for enhancing reproducibility and interoperability in computational research (Soiland-Reyes et al., 2021, Leo et al., 2023).
1. Definition, Scope, and Purpose
A RO-Crate Bundle, also referred to as an RO-Crate archive or package, constitutes an archive containing all digital objects (files, directories, datasets, source code, workflow definitions, reports, documentation, or external IRIs) pertinent to a specific research output. Unlike simple archival (e.g., zipping a directory), an RO-Crate Bundle mandatorily includes a metadata descriptor file—ro-crate-metadata.json—that explicitly and formally describes:
- The bundle as a “root data entity,” typically of type
schema:Dataset - All internal data entities (files and directories), as well as any referenced external resources
- Contextual entities including people (
schema:Person), organizations, licenses, places, and instruments - Relations, including composition (
hasPart), subject (about), annotations, and provenance attributes (e.g.,datePublished,creator,softwareUsed)
This model is designed for both broad interoperability and streamlined consumption, allowing bundles to be published or deposited in generic and discipline-specific repositories (e.g., Zenodo, GitHub Pages, WorkflowHub, PARADISEC) and stored using various packaging conventions (ZIP, BagIt, OCFL, or version control such as Git) (Soiland-Reyes et al., 2021).
2. Metadata Model and Logical Structure
The core metadata model of RO-Crate uses JSON-LD, flattened into a single @graph array, which results in efficient, O(1) entity lookups. The model selects a controlled subset of Schema.org/RDF terms and is guided by the principle of “just enough” Linked Data—simple enough for researchers to use directly, but sufficiently expressive for automated workflows and Linked Data systems.
Key Structural Elements
- @context: Specifies the RO-Crate mapping (typically
https://w3id.org/ro/crate/1.1/context), assigning compact keys to IRIs. - Root Data Entity: The bundle root, always typed as
schema:Dataset. - Data Entities: Files, code, workflows, or external references described via type such as
schema:File,schema:MediaObject,schema:ComputationalWorkflow, etc. - Contextual Entities: People (ORCID), organizations (ROR), licenses (SPDX/Creative Commons PID), and places (GeoNames).
- Core Properties:
hasPart,about,mentions,creator/author,datePublished,license,contentLocation,conformsTo. - Formal Semantics: The minimal requirements and entity relationships are defined using first-order logic. For instance:
A minimal RO-Crate thus must describe its own metadata descriptor, the root entity, at least zero or more data entities, and zero or more contextual entities (Soiland-Reyes et al., 2021).
3. Provenance, Persistent Identifiers, and Relations
RO-Crate Bundles encode relationships and provenance using explicit Schema.org predicates and IRIs. This practice fosters robust traceability and reliable provenance analysis.
- Entity Identification: All entities are addressable using
@id—relative paths for bundled files, absolute IRIs or PIDs (e.g., ORCID, DOI, ROR, SPDX) for external resources. - Provenance Attributes: Employ properties such as
datePublished,creator,author,contributor,softwareUsed,contentLocation, andspatialCoverage. - Annotations: Metadata such as
name,description,keywords, andcitationare included directly in the graph. - Extended Provenance: Profiles like Workflow Run RO-Crate (WRROC) integrate additional terms for prospective and retrospective computational workflow provenance. These enable detailed linkage of
CreateActionactivities (execution),SoftwareApplication, workflow parameters, inputs, outputs, and relationships between workflow plan and execution instances (Leo et al., 2023). - Alignment with W3C PROV: Entities and relations are directly mapped to PROV-O concepts via SKOS. For example:
| RO-Crate term | SKOS relation | PROV-O term | |--------------------------|---------------|-----------------------| | schema:CreateAction | broaderMatch | prov:Activity | | schema:Person | exactMatch | prov:Person | | schema:SoftwareApplication| relatedMatch | prov:SoftwareAgent | | schema:MediaObject | broaderMatch | prov:Entity | | CreateAction → object | exactMatch | prov:used | | CreateAction → result | closeMatch | prov:wasGeneratedBy |
- Formal Provenance Mappings in WRROC: For actions , entities , agents , parameter connections :
4. Archive Structure, Best Practices, and Example Manifest
Each RO-Crate Bundle possesses a predictable directory structure:
- ro-crate-metadata.json: Required, single JSON-LD file, containing the entire metadata graph.
- (Optional) ro-crate-preview.html: Automatically-generated HTML synopsis for human inspection.
- Payload Files: Data files, directories, scripts, documentation, and optionally external references, all declared in the manifest.
- Packaging Formats: Supported as plain directories, ZIP, BagIt, OCFL, or Git-managed repositories.
Best practices emphasize one @id per entity, no blank nodes, flattened and compacted JSON-LD, and the inclusion of only necessary Schema.org terms. Minimal bundles require name, description, datePublished, and license on the root dataset. Extensions are allowed via profiles but must be machine- and human-readable.
A representative minimal manifest:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
{
"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph": [
{
"@id": "ro-crate-metadata.json",
"@type": "CreativeWork",
"conformsTo": { "@id": "https://w3id.org/ro/crate/1.1" },
"about": { "@id": "./" }
},
{
"@id": "./",
"@type": "Dataset",
"name": "A simplified RO-Crate",
"description": "Example of a minimal research dataset bundle",
"datePublished": "2021-11-02T16:04:43Z",
"author": { "@id": "#alice" },
"license": { "@id": "https://spdx.org/licenses/CC-BY-4.0" },
"hasPart": [
{ "@id": "survey.csv" },
{ "@id": "https://example.com/pics/5707039334816454031_o.jpg" }
]
},
{ "@id": "survey.csv", "@type": "File", "name": "Survey responses", "author": { "@id": "#alice" } },
{ "@id": "#alice", "@type": "Person", "name": "Alice Example" }
]
} |
5. Extended Use: Workflow Provenance Bundles
Specialized profiles such as Workflow Run RO-Crate (WRROC) provide granular support for computational workflow plans and their execution provenance. WRROC distinguishes three levels:
- Process Run Crate: Describing single tool executions as black-box
CreateActionevents, with agent, instrument, inputs/outputs. - Workflow Run Crate: Adds a prospective (plan) side, with
ComputationalWorkflow, explicit input/output parameters (FormalParameter), linking runtime values (exampleOfWork). - Provenance Run Crate: Fine-grained step-level detail: each workflow step as
HowToStep, executions asControlAction, mapping parameter connections explicitly.
These design patterns facilitate comprehensive, cross-system interoperability. For instance, both Galaxy and StreamFlow can emit WRROC bundles encoding the same run, enabling direct, automated comparison and re-execution (Leo et al., 2023).
6. Creation, Validation, Tooling, and Real-World Adoption
Authoring an RO-Crate typically involves:
- Assembling the resource folder (data, code, workflows, docs).
- Generating
ro-crate-metadata.jsonusing tools such as:- ro-crate-py (Python)
- ro-crate-js (Node.js)
- ro-crate-ruby (Ruby)
- Describo (GUI)
- Adding contextual entities (ORCID, ROR, SPDX) and annotating all files/entities.
- Validating via built-in toolchain functionality or CheckMyCrate, especially for profile-conformance (e.g., Workflow Testing Profile).
- Generating a preview HTML file for human review.
- Packaging and depositing in relevant repositories; assigning a PID for citation.
Performance and scalability are ensured through flattened JSON-LD, decoupling manifest complexity from the number of artefacts, and by referencing—rather than embedding—large payloads as external IRIs.
RO-Crate is widely implemented across:
- Workflow systems (Galaxy, COMPSs, StreamFlow, WfExS-backend, Sapporo, Autosubmit)
- Digital repositories (Zenodo, Dataverse, WorkflowHub, PARADISEC)
- Data management plan mapping (RDA maDMP)
- Regulatory and provenance tracking (FDA BioCompute Objects, CPM profiles)
- Institutional data platforms (Harvard Data Commons)
Representative case studies include large-scale genomics workflows (ELIXIR, EOSC-Life), digital image analysis pipeline comparison, cultural heritage archiving (PARADISEC), and distributed AI model training with provable provenance and reproducibility (Soiland-Reyes et al., 2021, Leo et al., 2023).
7. Significance and Impact
The formalism and adoption of RO-Crate Bundles have provided a pivotal advance in the packaging and sharing of research artefacts. By combining minimal, profile-driven Linked Data semantics with a singular JSON-LD manifest and predictable directory conventions, RO-Crate delivers practical machine-actionability—supporting repeatability, transparency, and cross-repository interoperability. The active development of specialized profiles, such as Workflow Run RO-Crate, demonstrates the model’s extensibility and capacity to meet evolving demands in computational provenance and data stewardship in heterogeneous scientific domains (Soiland-Reyes et al., 2021, Leo et al., 2023).