Descriptor-Driven Workflows

Updated 1 April 2026

Descriptor-driven workflows are computational paradigms that externalize workflow logic, configuration, and metadata using structured descriptors, thereby ensuring modularity and clear system architecture.
They leverage formal ontologies and descriptor schemas to map high-level specifications to executable components, supporting multi-layer descriptions, dynamic reconfiguration, and precise provenance tracking.
These workflows enable robust and scalable management of distributed tasks across diverse domains—from scientific computing to federated learning and audio synthesis—by decoupling control flow from hard-coded implementations.

A descriptor-driven workflow is a computational or procedural paradigm in which all workflow logic, structure, control flow, configuration, and metadata are externalized and governed by machine-readable descriptors. These descriptors, typically structured as JSON, YAML, RDF, or similar artifacts, specify in precise terms the components, data dependencies, execution environments, and operational logic required to instantiate, adapt, and execute workflows across a variety of domains—including scientific computing, federated learning, audio synthesis, data engineering, and atomically resolved microscopy. By shifting workflow orchestration from hard-coded logic to abstract, declarative descriptors, descriptor-driven systems achieve modularity, reconfigurability, interpretability, and, in many implementations, compliance with the FAIR (Findable, Accessible, Interoperable, Reusable) data principles (Veluvali et al., 2024, Guan et al., 3 Oct 2025, Fenoglio et al., 27 Nov 2025, Käfer et al., 2018, Devis et al., 2023, Barakati et al., 2024, Gschwind et al., 10 Oct 2025).

1. Formal Ontologies and Descriptor Schemas

At the core of descriptor-driven workflows lies a formal ontology or schema that defines the types of entities (e.g., Task, Data, Environment, Workflow, Model) and permissible relationships (e.g., hasInput, hasOutput, hasEnvironment, implements, abstracts). In the MaRDIFlow framework, the ontology $\mathcal{M} = (C,P,R)$ comprises classes $C$ (Task, Data, Environment, Workflow, Model, Descriptor), properties $P$ (hasInput, hasOutput, etc.), and relations $R$ (composition, substitution, provenance). Each Descriptor $D$ is a tuple

$D = (d_t, d_\text{data}, d_\text{env})$

where $d_t \in \text{Task}$ , $d_\text{data} \in \text{Data}$ , $d_\text{env} \in \text{Environment}$ are connected via the corresponding property axioms. Practical execution leverages type- or schema-driven mappings from these descriptors to concrete workflow units and executable artifacts (Veluvali et al., 2024).

In data-driven ETL pipelines, descriptor schemas define both operator sets (e.g., the 138 DataStage stage types) and allowed property schemas for each operator, facilitating classifier-augmented stage selection, edge construction, and property inference from natural-language or formal descriptions (Gschwind et al., 10 Oct 2025).

2. Descriptor–Driven Workflow Construction and Execution

Descriptor-driven orchestration operates by iteratively or recursively interpreting the descriptor artifact to generate, parameterize, and schedule units of work. Key stages include:

Descriptor Parsing and Mapping: Systems such as MaRDIFlow define formal mappings $f$ and $C$ 0:

$C$ 1

$C$ 2

yielding workflow components parameterized by the JSON or RDF descriptor.
Multi-Layer Descriptions: Workflows are modeled at three levels—abstract (mathematical specification), concrete (code or binary implementation), and execution (container, cluster job, or direct HPC submission)—with explicit interlayer transformations $C$ 3 (Veluvali et al., 2024).
Operational Semantics: In Linked Data contexts, RDF/OWL descriptors specify workflow control flow as class/property graphs. A rule-based engine (e.g., ASM4LD) drives execution by polling descriptors, deriving state transitions, and firing actuation rules based solely on descriptor state, supporting patterns such as sequence, parallel split, exclusive choice, and loops (Käfer et al., 2018).
Declarative Control Flow: Modern engines support conditional execution, loops, and branching constructs directly in descriptor syntax (YAML, JSON), allowing dynamic expansion of complex DAGs or cyclic workflows at runtime (Guan et al., 3 Oct 2025, Veluvali et al., 2024, Gschwind et al., 10 Oct 2025).

3. Data, Metadata, and Provenance Management

Descriptors centralize all metadata for unit tasks, data artifacts, and execution environments. Key practices include:

Input/Output Signatures: Each Task descriptor contains explicit references to input and output data objects, often using schema.org, DCAT, or custom ontologies (Veluvali et al., 2024).
Environment Encapsulation: Execution environments are described in detail (e.g., container images, resource constraints), enabling exact re-instantiation and provenance tracking.
FAIR Compliance: Systems such as MaRDIFlow mint UUIDv4 identifiers for descriptors, provide REST/JSON-LD endpoints for access, and integrate schema-level metadata, licensing, and W3C PROV-compliant provenance chains to ensure workflows are discoverable, reproducible, and interoperable (Veluvali et al., 2024).

In federated learning (FLUX), client-side descriptors encode local distributional statistics (means and covariances of latent features, both marginal and conditional), enabling privacy-preserving clustering, test-time adaptation, and fine-grained provenance of deployed or adapted models (Fenoglio et al., 27 Nov 2025).

4. Adaptability, Live Reconfiguration, and Dynamic Properties

A distinguishing feature of descriptor-driven workflows is the ability to reconfigure workflows at runtime by modifying descriptors:

Live Reconfiguration: Editing a JSON configuration (e.g., changing "initial_temperature") or environment file immediately regenerates the corresponding workflow task, re-builds downstream dependencies, and triggers new computations or cluster submissions (Veluvali et al., 2024).
Conditional Logic and Loop Expansion: Conditional expressions and loop constructs in descriptors determine control flow. In iDDS, loop constructs (foreach/while) are evaluated and expanded by workflow agents, enabling dynamic workload scaling and adaptive resource allocation in response to environmental or data conditions (Guan et al., 3 Oct 2025).
Real-Time Control in Audio Synthesis: In deep audio models, musicians modulate descriptors (e.g., loudness, brightness, sharpness) via continuous controls ("knobs"), which feed directly into neural network decoders to effect immediate changes in synthesized output waveform, with adversarially enforced disentanglement of descriptors in latent spaces (Devis et al., 2023).
Reward–Optimized scientific workflows: In STEM imaging workflows, parameter sweeps over descriptor definitions (e.g., patch sizes) and clustering settings are guided by explicit, physics-aligned reward functions, with multi-objective optimization to maximize domain wall straightness and continuity, incorporating both unsupervised clustering and variational autoencoder architectures (Barakati et al., 2024).

5. Scalability, Modularity, and Performance

Descriptor-driven architectures yield high scalability and modularity in workflow management:

Massive DAG Expansion: Descriptor-centric agents can unroll hierarchical or cyclic graphs containing $C$ 4– $C$ 5 tasks dynamically without full materialization upfront, enabling efficient management of large-scale scientific workflows (Rubin Observatory, ATLAS) (Guan et al., 3 Oct 2025).
Distributed Scheduling: Scheduling and data movement are driven by distributed metadata-aware agents that ingest descriptors, reason about dependencies, allocate resources, and react to system states in real time, ensuring fault-tolerant, stateless operation (Guan et al., 3 Oct 2025).
Efficiency Benchmarks: In RDF-driven IoT orchestration, average workflow completion time scales linearly with the number of workflow instances and controlled endpoints (Käfer et al., 2018). In federated learning, descriptor-driven clustering and model dispatch incur overhead negligible compared to baseline FedAvg, while robustly improving performance under extreme heterogeneities (Fenoglio et al., 27 Nov 2025).

6. Domain-Specific Implementations and Extensions

Descriptor-driven designs have been realized in diverse domains:

System/Framework	Domain/Application	Descriptor Format(s)
MaRDIFlow (Veluvali et al., 2024)	Computational Science (FAIR)	JSON, INI, RDF, JSON-LD
iDDS (Guan et al., 3 Oct 2025)	Distributed/HPC Orchestration	YAML, JSON
FLUX (Fenoglio et al., 27 Nov 2025)	Federated Learning	Numeric vectors
ASM4LD (Käfer et al., 2018)	Linked Data/IoT	RDF/OWL
CAG (Gschwind et al., 10 Oct 2025)	ETL/NLP Pipeline Generation	Free text ⇒ structured
Deep Audio (Devis et al., 2023)	Audio Synthesis	Real-valued vector knobs
Reward-driven STEM (Barakati et al., 2024)	Unsupervised Imaging Analysis	Vector, clustering params

Systems generalize to new domains given a finite operator/task schema and formalized descriptor schema. Machine learning-based systems (CAG) further leverage natural language as a descriptor to synthesize structured, interpretable workflows (nodes, edges, properties), validated against strong type and semantic constraints (Gschwind et al., 10 Oct 2025).

7. Limitations and Open Challenges

Current descriptor-driven infrastructures face nontrivial constraints:

Expressivity: Rich control flow (arbitrary loops, deep conditionals) may exceed the representational power of flat JSON/YAML schemas; workarounds involve embedded scripts or planned DSLs (Veluvali et al., 2024, Guan et al., 3 Oct 2025).
Ontology Standardization: Incomplete or homebrew ontologies impede interoperability; ongoing work aims at broader alignment (IRIS-HEP, OpenMath) (Veluvali et al., 2024).
Usability: Manual descriptor editing remains a barrier; electronic lab notebook (ELN) interfaces and drag-and-drop assembly tools are under development (Veluvali et al., 2024).
Versioning and Provenance: Synchronizing all downstream components with upstream descriptor changes, and guaranteeing full provenance-aware rebuilds, remains a subject of active research—prototype systems rely on hash-derived triggers (Veluvali et al., 2024).
Evaluation Metrics: In workflow prediction, property-level recall and exact graph recovery still lag behind precision; optimization of LLM prompts and hybrid classifier–generator ensembles remains a productive direction (Gschwind et al., 10 Oct 2025).

Descriptor-driven workflows—by making every aspect of workflow instantiation, execution, data flow, and adaptation explicit, machine-readable, and verifiable—form a unifying backbone for scalable, transparent, and adaptive computational science, data engineering, federated learning, and creative machine learning applications.