LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology (2509.13978v1)

Published 17 Sep 2025 in cs.DC, cs.AI, and cs.DB

Abstract: Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive LLM agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.

Summary

The paper introduces a modular reference architecture enabling LLM agents to interactively query complex workflow provenance.
It details a domain-agnostic evaluation methodology with prompt engineering and RAG strategies, comparing performance across multiple LLMs.
The study demonstrates scalability and adaptability in real-world workflows while highlighting challenges in complex graph-based provenance queries.

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

Introduction and Motivation

This paper presents a comprehensive reference architecture and evaluation methodology for deploying LLM agents to enable interactive querying and analysis of workflow provenance data in distributed scientific computing environments. The motivation stems from the increasing complexity and scale of scientific workflows operating across the Edge, Cloud, and High Performance Computing (HPC) continuum, where provenance data—metadata describing the lineage, execution, and context of workflow tasks—becomes both voluminous and semantically intricate. Traditional approaches relying on custom scripts, structured queries, or static dashboards are insufficient for exploratory, real-time, and flexible data interaction. The proposed agentic architecture leverages LLMs, prompt engineering, and Retrieval-Augmented Generation (RAG) to bridge the gap between users and complex provenance databases, supporting natural language interaction, live monitoring, and advanced analytics.

Figure 1: Taxonomy of workflow provenance query characteristics used to define query classes.

Reference Architecture for Provenance-Aware LLM Agents

The architecture is modular and loosely coupled, designed for scalability and interoperability across heterogeneous infrastructures. Provenance capture is achieved via two mechanisms: non-intrusive observability adapters (e.g., RabbitMQ, MLflow, file systems) and direct code instrumentation (e.g., Python decorators for Dask, PyTorch). Provenance messages are buffered and streamed asynchronously to a central hub using publish-subscribe protocols (Redis, Kafka, Mofka), minimizing interference with HPC workloads. The architecture supports multiple backend DBMSs (MongoDB, LMDB, Neo4j) and exposes a language-agnostic Query API for programmatic, dashboard, or natural language access.

A key innovation is the dynamic dataflow schema, incrementally inferred at runtime from live provenance streams, summarizing workflow structure, parameters, outputs, and semantic relationships. This schema is maintained in-memory by the agent's Context Manager and used to condition LLM prompts, enabling effective query translation and reasoning without requiring access to raw provenance records. This approach is particularly advantageous for privacy-preserving deployments and for workflows with rapidly evolving schemas.

Figure 2: Provenance AI Agent Design.

Evaluation Methodology: Prompt Engineering and RAG Strategies

The evaluation methodology is domain-agnostic and system-independent, focusing on the design of RAG pipelines and prompt engineering to assess LLM performance across diverse provenance query classes. The process is iterative, comprising:

Query Set Definition: Curated natural language queries mapped to taxonomy-defined classes and expected answers.
Prompt Engineering: Techniques ranging from zero-shot to few-shot and chain-of-thought, with role-giving and domain-specific guidelines.
RAG Strategies: Context enrichment via schema, metadata, and representative domain values, balancing token consumption and semantic coverage.
Evaluation: LLM-as-a-judge and rule-based scoring, emphasizing functional equivalence and scalability.
Experimental Runs and Refinement: Iterative improvement of prompts, RAG, and agent logic based on evaluation feedback.
Figure 3: Evaluation methodology to iteratively improve the agent’s performance via prompt and context tuning.

Experimental Results: Synthetic and Real-World Workflows

The agent was implemented atop the Flowcept infrastructure using Python MCP SDK, with GUI and API interfaces. Two workflows were used for evaluation:

Synthetic Math Workflow: Designed for rapid prototyping and prompt tuning, enabling controlled scaling and deterministic behavior.
Computational Chemistry Workflow: Real-world DFT-based analysis of bond dissociation energies, executed on the Frontier supercomputer, featuring complex nested schemas and domain-specific semantics.
Figure 4: Use case workflows: (A) Synthetic math workflow; (B) Real computational chemistry workflow for Bond Dissociation Energy (BDE) analysis using Density Functional Theory (DFT).

LLMs evaluated included LLaMA 3 (8B, 70B), GPT-4, Gemini 2.5, and Claude Opus 4. Prompts were incrementally enriched with context components (role, job, DataFrame format, few-shot examples, schema, domain values, guidelines). Query accuracy was assessed using LLM-as-a-judge (GPT and Claude), with each query executed multiple times to mitigate stochasticity.

Figure 5: Scores assigned by two different judges.

Figure 6: Different LLMs' performance in different query classes.

Key Findings

Contextual Enrichment: Adding query guidelines and few-shot examples yielded the largest performance gains with minimal token overhead. Schema and domain values improved semantic alignment but increased token usage.
Model Performance: GPT-4 and Claude Opus 4 achieved near-perfect scores; LLaMA and Gemini exhibited greater variability and lower accuracy, especially on OLAP queries involving graph-like reasoning.
Judge Bias: Each LLM-as-a-judge showed mild preference for its own outputs, but overall ranking trends were consistent, supporting reliability.
Scalability: The metadata-driven approach decouples LLM performance from data volume, enabling lightweight operation even for large-scale workflows.
Generalization: The agent generalized from synthetic to real-world workflows without domain-specific tuning, correctly or partially correctly answering over 80% of queries in the chemistry use case.
Figure 7: Impact of contextual information components against performance and token consumption.

Figure 8: Impact of contextual information components against performance and token Consumption.

Live Interaction and Real-Time Analytics

A live demonstration on the Frontier supercomputer showcased the agent's ability to support real-time, natural language interaction with the chemistry workflow. The agent responded to queries with tables, plots, and summaries, inferring units and domain concepts, and supporting hypothesis validation and monitoring. While most queries were answered correctly, some edge cases (e.g., ambiguous atom counts, custom visualizations) revealed limitations in semantic inference and prompt adaptation.

Figure 9: Live interaction with the chemistry workflow. The user interacts in natural language and receives responses, including plots, tabular results, and summarized text.

Implications, Limitations, and Future Directions

Practical Implications

Accelerated Data-to-Insights: The agentic approach reduces the barrier to exploratory provenance analysis, anomaly detection, and monitoring, facilitating scientific discovery in complex ECH workflows.
Modularity and Extensibility: Separation of concerns enables easy integration of new tools, scaling, and adaptation to diverse workflows and provenance systems.
Privacy and Efficiency: Metadata-driven schema conditioning avoids raw data exposure and context window overflow, supporting secure and efficient deployment.

Theoretical Implications

Schema-Driven Reasoning: Dynamic dataflow schemas enable LLMs to reason over workflow structure and semantics, supporting generalization and adaptability.
Evaluation Methodology: LLM-as-a-judge provides scalable, nuanced assessment of agent performance, though human oversight remains necessary to mitigate bias and hallucination.

Limitations

Graph-Based Provenance Queries: Deep causal analysis over persistent databases remains challenging; current DataFrame-centric logic is insufficient for multi-hop graph traversals.
Semantic Quality Dependency: Agent performance depends on the intentionality and descriptiveness of workflow code; poor variable naming or lack of annotations can hinder inference.
No Universal LLM: No single model excels across all query classes, motivating research into adaptive LLM routing and ensemble methods.

Future Work

Dynamic Semantic Enrichment: Automated inference of domain semantics from code and data to improve agent reasoning.
Feedback-Driven Prompt Tuning: Integration of auto-fixer agents for runtime correction and guideline adaptation.
Scalable Graph Querying: Extension of architecture to support complex graph traversals and causal analysis in provenance databases.
Extreme-Scale Workflows: Migration to high-performance in-memory buffers (e.g., Polars) for massive provenance data.

Conclusion

This work establishes a robust foundation for interactive, provenance-aware LLM agents in scientific workflows. By leveraging modular architecture, dynamic schema conditioning, and iterative prompt engineering, the agent enables accurate, scalable, and generalizable interaction with complex provenance data. The approach is validated across synthetic and real-world workflows, demonstrating high accuracy and adaptability. Open challenges remain in semantic enrichment, scalable graph querying, and adaptive LLM selection, but the proposed methodology and architecture provide a clear path forward for intelligent workflow analysis and accelerated scientific discovery.