Collaborative Research Protocol (OSP)

Updated 24 November 2025

Collaborative Research Protocol (OSP) is a comprehensive framework combining architectural designs, APIs, and metadata standards to support secure, reproducible, and scalable scientific collaborations.
It employs federated workflow orchestration, robust data versioning, and provenance tracking to enhance scientific credibility and traceability.
OSP integrates advanced security, privacy, and tool integration practices to lower discovery barriers and facilitate efficient data sharing on distributed computational platforms.

A Collaborative Research Protocol (OSP) denotes the comprehensive suite of architectural frameworks, workflow conventions, APIs, metadata standards, and reproducibility policies that collectively enable multidisciplinary research teams to jointly ingest, analyze, annotate, share, version, cite, and publish scientific data and derived artifacts on open science platforms. OSPs are designed to maximize scientific credibility, lower barriers to discovery, and ensure robust, reproducible outcomes across distributed computational resources. Implementations such as OSPREY for epidemic modeling (Collier et al., 2023), the Open Science Platform architecture for next-generation Dataverse (Sweeney et al., 2015), and TELOS Collaboration’s protocol for reproducible lattice data analysis (Bennett, 2 Apr 2025) illustrate core requirements and operational best practices for OSPs.

1. Architectural Components and Workflow Orchestration

A typical OSP integrates federated, algorithm-driven workflow fabrics spanning heterogeneous computational sites. High-level components include a Model Exploration module (“the algorithm”), a task distribution layer (e.g., funcX), relational task databases (e.g., EMEWS DB), execution pilots (Swift/T MPI, Python/R pools), and wide-area data sharing services (ProxyStore, Globus) (Collier et al., 2023). Within the Dataverse OSP, the architecture is tiered: Public Tier (UI/API), Operations Tier (policy enforcement, scheduling, provenance), and Data Tier (federated storage, secure views) (Sweeney et al., 2015). TELOS’s approach emphasizes deeply versioned git repositories, fine-grained workflow managers (Snakemake), and Conda environments for reproducibility (Bennett, 2 Apr 2025).

Workflow protocols decouple producers and consumers using durable SQL-based output queues, asynchronous APIs for “fast time-to-solution” algorithms, batch and threshold tuning for back-pressure management, and elastic execution pools that scale in response to algorithmic demand. Control messages traverse secure central services (OAuth2 + TLS), while code execution and data transfer are site-to-site to eliminate bottlenecks (Collier et al., 2023).

2. Data Ingestion, Versioning, and Provenance

OSP protocols unify the collection of primary data (files, streams, instruments, external repositories) and construction of private or shared workspaces (“Dataverses,” “studies”) (Sweeney et al., 2015). Data ingestion endpoints expose RESTful APIs for file and stream uploads, automated metadata extraction (quantitative, structural, semantic), and privacy-level assignment via policy wizards (six-level taxonomy) (Sweeney et al., 2015).

Versioning is fundamental: each immutable snapshot receives a persistent identifier (Handle, DOI), chained into a derivation chain. On OSPREY, experiment code and payloads are tagged with git commit hashes, pipelines record origin timestamps, versions, and checksums; all tasks and results persist in the database for durable provenance (Collier et al., 2023). TELOS further requires that final published CSVs be machine-readable, and every number quoted in manuscripts is defined in auto-generated LaTeX files to guarantee precision and auditability (Bennett, 2 Apr 2025).

The W3C PROV-DM model structures fine-grained provenance, capturing Entities (datasets, files), Activities (ingest, transform, analysis), and Relations (wasGeneratedBy, used, wasDerivedFrom). Provenance queries traverse the lineage from derived products to their origin datasets (Sweeney et al., 2015).

3. Security, Privacy, and Access Control

Security mechanisms in OSPs are explicit and multimodal. FuncX endpoints and OSP user interfaces authenticate via OAuth2 tokens and Shibboleth, with support for two-factor authentication to ensure compliance with dataset privacy levels. Data and control traffic utilize HTTPS/TLS, and site-specific VPNs or SSH tunnels protect access to private database instances (Collier et al., 2023, Sweeney et al., 2015). OSP platforms enforce the Secure Views Model: for each request, access policy is defined by

$\mathit{Allow}(u,D,V)\;=\;[\mathit{Clearance}(u)\ge\ell]\wedge[\mathit{DUA}(u,D)=\mathit{true}]$

where $\ell$ is the policy level assigned to the dataset, $\mathit{Clearance}(u)$ is user privilege, and $\mathit{DUA}(u,D)$ signals data use agreement completion (Sweeney et al., 2015).

FERPA constraints, IRB/DUA templates, and automated privacy certification (Privacert-style) reinforce compliance for sensitive human subject data. Role-based access control is mapped to experiment IDs or user groups, with per-endpoint permissions enforced in the funcX service (Collier et al., 2023).

4. Reproducibility, Publication, and Licensing

Reproducibility is non-negotiable. TELOS defines “reproducible” data as those allowing another researcher to obtain identical results from the same datasets and analysis (Bennett, 2 Apr 2025). Protocols require persistent identifiers (DOIs) for all data and workflow releases, narrative and data under Creative Commons Attribution 4.0 (CC BY), and analysis/workflow code under GPL v3.0 or MIT as agreed (Bennett, 2 Apr 2025). The OSP citation conventions follow the Altman-King standard: Author(s), Year, “Title”, PersistentURL, UNF:Fingerprint, Version (Sweeney et al., 2015).

Universal Numeric Fingerprint (UNF) is calculated as a digest of normalized quantitative tables: $\mathrm{UNF}(D) = \mathrm{Base64}\bigl( \mathrm{SHA256}( \mathrm{norm}(D) ) \bigr)$ ensuring that any content-altering transformation produces a new fingerprint (Sweeney et al., 2015).

Workflows and data releases are packaged for Zenodo (InvenioRDM). CSVs are kept separate, ZIP archives are used for raw logs (with their own README files), and HDF5 for repackaged data. Automated LaTeX definitions and deterministic RNG seeding from metadata further enforce reproducibility (Bennett, 2 Apr 2025).

5. Fault-Tolerance, Scalability, and Workflow Automation

OSP platforms are architected for scalable, fault-tolerant computation across distributed resources. Durable queues ensure task persistence across failures, automatic retry logic restores operation upon endpoint or pilot reconnection, and checkpointing is available for long-running stateful pilots (Collier et al., 2023). Elasticity is intrinsic: worker pools can be dynamically scaled with no interruption or restart required for the algorithm controller.

Back-pressure is managed by batch size ( $B$ ) and threshold ( $T$ ) parameters in task querying—pools avoid both starvation and overfetching, stabilizing resource utilization under I/O variability (Collier et al., 2023). Formal scheduling invariants constrain pools: $\sum_{\substack{t\in T_r ,\, \mathrm{status}(t) = \mathrm{running}}} 1 \leq C_r$ where $T_r$ is the task set assigned to pool $r$ , $C_r$ is its worker capacity. Surrogate-driven prioritization solves

$\max_{\sigma\in\Sigma} \sum_{t\in U} p_{\mathrm{new}}(t) , \quad \text{s.t.} \quad p_{\mathrm{new}}(t) = f_{\mathrm{surrogate}}(\mathrm{history}, t)$

for ranking and steering behavior (Collier et al., 2023).

TELOS is advancing continuous integration via DVC and GitHub Actions, workflow automation via job coordinators (PLATESPINNER), and provenance tracking with W3C PROV schemas (Bennett, 2 Apr 2025).

6. API Specifications, Metadata, and Tool Integration

Protocols prescribe HTTP/REST APIs throughout, supporting chunked transfers, WebSockets, and HTTP long polls for large data. Data formats include CSV, TSV, XML, JSON (tabular); GraphML (graphs); TIFF/JPEG2000 (images); GeoJSON and Shapefiles (maps) (Sweeney et al., 2015). Metadata schemas span DDI, Dublin Core, MARC, PROV-XML, and OAI-PMH for harvesting and interoperability.

Tool integration uses language-agnostic APIs, e.g., R/Zelig, Hadoop, Python, MapD, and supports in-DB analytics with MapReduce-style or SQL kernels (Sweeney et al., 2015). Results are published with complete metadata, provenance, and landing pages indexed for searchability and citation.

7. Protocol Templates, Guidelines, and Best Practices

Templates and checklists are integral for protocol adoption. TELOS provides author.xml for INSPIRE metadata, workflow repository skeletons (.gitignore, .pre-commit-config.yaml, Snakefile, CITATION.cff), and comprehensive journal article checklists. Recommended practices include peer code review, test runs by independent collaborators, provenance embedding of commit IDs and parameters, deterministic random seed generation, and the use of continuous integration and artifact versioning systems (Bennett, 2 Apr 2025).

A plausible implication is that robust protocol adherence facilitates the extension of OSPs to diverse scientific domains by substituting appropriate data types, workflow schemas, and backend transport mechanisms while preserving secure, asynchronous, and reproducible orchestration.

By formalizing and operationalizing these components, Collaborative Research Protocols deliver the reproducibility, scalability, and organizational clarity required for modern open science collaborations across large-scale computational environments.