Agentic HPC Middleware
- Agentic HPC middleware is a system software paradigm that embeds AI agents into the orchestration layer to autonomously manage and optimize HPC resources in real time.
- The architecture leverages closed-loop decision-making, unified data-control abstraction, and federated infrastructures to dynamically adapt and scale complex workflows.
- It enables low-latency task management and robust fault tolerance, as evidenced by systems like Academy, RHAPSODY, and Claw-R1, which deliver high throughput across diverse hardware.
Agentic HPC Middleware refers to a class of system software that empowers autonomous, LLM-driven or multi-agent workflows to steer, orchestrate, and optimize high-performance computing (HPC) resources in real time. Unlike conventional middleware that focuses on job scheduling, resource isolation, and static data movement, agentic middleware incorporates closed-loop decision-making, adaptive task generation, and dynamic resource allocation, enabling the automation and scaling of complex scientific and industrial processes on supercomputers and federated infrastructures (Brewer et al., 2024).
1. Conceptual Foundations and Defining Properties
Agentic HPC middleware embeds AI “agents”—software entities capable of perception, action, and learning—directly into the orchestration layer of scientific workflows and simulation campaigns. The agents autonomously observe system state (simulation outputs, telemetry, intermediate data products), make runtime decisions (e.g., spawn new simulations, alter parameters, allocate new resources), and issue high-level control commands across heterogeneous compute and storage facilities (Brewer et al., 2024). Key defining properties include:
- Continuous adaptive control: Persistent AI-driven loops recursively analyze workflow and simulation state, dynamically pruning, generating, and scaling tasks (Alsaadi et al., 23 Dec 2025).
- Unified data and control abstraction: AI agents operate seamlessly across disparate scientific tasks, treating both simulation and AI processes as composable, managed entities (Pauloski et al., 8 May 2025, Alsaadi et al., 23 Dec 2025).
- Real-time and asynchronous composition: Supports dynamic task launches, agent coordination, and results aggregation with low-latency, scalable communication protocols (Alsaadi et al., 23 Dec 2025, Pauloski et al., 8 May 2025).
- Federated, heterogeneous infrastructure: Integrates CPUs, GPUs, accelerators, burst resources, and edge or cloud components within a common substrate (Lopez et al., 16 Sep 2025).
This middleware paradigm is motivated by the increasing complexity, concurrency, and dynamism of modern AI-in-the-loop scientific workloads, and underpins domains from reinforcement learning to atomistic modeling and exascale molecular discovery (Wang et al., 8 Jun 2026, Dawson et al., 24 Apr 2026, Sinclair et al., 17 Dec 2025).
2. Architectural Components and System Patterns
Agentic middleware architectures consist of several core components and interlocking subsystems, with notable exemplars including Academy (Pauloski et al., 8 May 2025), RHAPSODY (Alsaadi et al., 23 Dec 2025), Claw-R1 (Wang et al., 8 Jun 2026), and DALIA (Rodriguez-Sanchez et al., 24 Jan 2026). Generalized system patterns are summarized below.
| Component | Function | Example Systems |
|---|---|---|
| Agent Manager (Control Plane) | Hosts decision logic (LLM/RAG/iterative planners); manages agent and tool registry | Academy, RHAPSODY, DALIA |
| Execution Fabric (Data Plane) | Launches and oversees compute tasks—often via workflow engines (Parsl, Dragon) | Academy, RHAPSODY, Claw-R1 |
| Data Pool (Step Store) | Persistent, indexed storage for agentic data, interaction traces, rewards | Claw-R1 |
| Orchestrator | Coordinates phases (discovery, planning, execution), enforces global policy | DALIA, HEPTAPOD, RHAPSODY |
| Communication Layer | Supports RPC, message-passing, pub-sub, zero-copy data exchange | SmartRedis, ProxyStore, Globus |
| Scheduler Adapter | Maps logical tasks/agents to SLURM, PBS, Kubernetes, cloud-native backends | Academy, RHAPSODY, LARA-HPC |
The architectural patterns follow execution motifs that include ensemble steering, inverse design, adaptive pipelines, federated model training, and agent-driven digital replicas (Brewer et al., 2024). Notably, RHAPSODY composes runtimes to make tasks, services, and resources uniformly manageable and schedulable, supporting concurrent simulation, persistent inference services, and fine-grained AI feedback in a single allocation (Alsaadi et al., 23 Dec 2025). Claw-R1 introduces a Gateway Server and Data Pool—creating a data-centric decoupling between agent rollouts and RL training, optimized for HPC throughput (Wang et al., 8 Jun 2026).
3. Data, Control, and Inter-Agent Protocols
Agentic middleware implements robust dataflow and control mechanisms with strong guarantees on consistency, auditability, and asynchrony.
- Step-level or batch-level record models: Claw-R1 encodes each agent-environment interaction as a “StepRecord” with promptID, responseID, tokens, reward, trajectoryID, policyVersion, and rich metadata, stored using prefix-tree (trie) compaction for token efficiency (Wang et al., 8 Jun 2026).
- Federated discovery and registry: DALIA maintains a distributed directory of agent capabilities and execution resources, enforcing strict separation between capability registration, discovery, graph planning, and execution (Rodriguez-Sanchez et al., 24 Jan 2026).
- Message passing and RPC: Protocols include REST/gRPC for agent submission, task/result reporting, and batch pull (Claw-R1, MCP-based systems), as well as mailbox/queue-based messaging for distributed agent pools (Academy) (Pauloski et al., 8 May 2025, Pham et al., 9 Apr 2026).
- Schema-validated tool contracts: Tools and workflow steps are specified with JSON schemas for input/output contracts, enabling deterministic graph expansion, error diagnosis, and human-in-the-loop control (HEPTAPOD, Academy, DALIA) (Menzo et al., 17 Dec 2025, Pauloski et al., 8 May 2025, Rodriguez-Sanchez et al., 24 Jan 2026).
These protocols enable expressive agent-driven orchestration under strong typing and reproducibility constraints, with clear separation of planning and execution (DALIA, RHAPSODY). This design reduces the risk of hallucinated or invalid actions observed in prior agentic system failures (Rodriguez-Sanchez et al., 24 Jan 2026).
4. Scheduling, Resource Allocation, and Scaling Strategies
Resource management in agentic HPC middleware is characterized by multi-runtime scheduling, elastic scaling, heterogeneity awareness, and decentralized dynamic task distribution.
- Unified resource pools: Middleware such as RHAPSODY partitions compute pools to co-locate MPI, GPU-resident inference services, and analysis tasks, scheduling and backfilling opportunistically across sub-pools (Alsaadi et al., 23 Dec 2025).
- Dynamic sharding and hashing: Claw-R1 uses trajectoryID/promptID sharding for scalable ingestion and batch serving in HPC settings, supporting ~10⁶–10⁷ steps/s ingestion in distributed deployments (Wang et al., 8 Jun 2026).
- Batch and microservice hybrids: Dual-stack architectures integrate cloud-native orchestrators (Kubernetes, FaaS), job schedulers (SLURM, PBS), and object stores (S3, PFS), while bridging batch jobs with on-demand services for agentic loops (Lopez et al., 16 Sep 2025).
- Tournament and planner–executor models: Systems like Academy and high-throughput materials screening frameworks distribute workloads using tournament-based agent competitions or hierarchical planner–executor-agent architectures, balancing parallel efficiency with robust fault tolerance and agent specialization (Sinclair et al., 17 Dec 2025, Pham et al., 9 Apr 2026).
Parallel efficiency and middleware overheads are empirically characterized. For example, Academy achieves end-to-end middleware overheads of 3–5% at 256–512 node scale, with weak scaling efficiency E(256) ≈ 0.80 for MD agents and E(512) ≈ 0.68 for binder design (Sinclair et al., 17 Dec 2025). RHAPSODY demonstrates orchestration overhead below 5% and supports heterogeneity widths (concurrent task types) up to 22 on 1,024 nodes, maintaining real-time agent-to-HPC realization rates with <2 s lag for 49,000-agent workloads (Alsaadi et al., 23 Dec 2025, Pham et al., 9 Apr 2026).
5. Performance, Fault Tolerance, and Robustness
Agentic middleware is engineered for both high-throughput and real-time reaction loops. Performance attributes include:
- Low-latency orchestration: Communication protocols (e.g., Redis, SmartRedis, ProxyStore) deliver sub-millisecond data exchanges; cold starts for HPC FaaS functions approach 25–45 ms, enabling sub-100 ms agentic loops (Lopez et al., 16 Sep 2025, Alsaadi et al., 23 Dec 2025).
- Scalable batch and event handling: Pull-based data serving, asynchronous message buses, and persistent service lifecycles maximize utilization and minimize blocking on resource contention (Wang et al., 8 Jun 2026).
- Resilience and recovery: All durable state (agent registrations, step records) is persisted; failures trigger automatic checkpoint restarts (via Parsl/Globus Compute), backoffs, or re-planning cycles, validated with internal QA agents or validation engines (Academy, LARA-HPC) (Sinclair et al., 17 Dec 2025, Dawson et al., 24 Apr 2026).
- Auditable and reproducible execution: All task invocations, tool parameters, and outputs are logged with full provenance; deterministic task graph expansion (DALIA, HEPTAPOD) guarantees run-to-run traceability (Menzo et al., 17 Dec 2025, Rodriguez-Sanchez et al., 24 Jan 2026).
Metrics tracked include task throughput (up to 3.4k actions/s for Academy at 208 agents), action completion latency (∼385 μs for Academy vs. 1,186 μs for Dask), and sustained GPU utilization in jointly managed resource pools (>75%) (Pauloski et al., 8 May 2025, Lopez et al., 16 Sep 2025). Validation-driven approaches (LARA-HPC) reduce resource waste by filtering invalid workflows prior to submission, lowering HPC error rates from ∼40% to <5% (Dawson et al., 24 Apr 2026).
6. Extensibility, Domain Specialization, and Best Practices
Agentic middleware is explicitly designed for extensibility, domain adaptation, and federated operation.
- Plugin-based agent and tool models: Tools can be added by containerizing a CLI entry point and registering I/O schemas; agent behaviors are subclassed; agent registration and discovery are schema-driven (Academy, DALIA) (Pauloski et al., 8 May 2025, Rodriguez-Sanchez et al., 24 Jan 2026).
- Templates and schema validation: All workflow steps and tool invocations are contractually specified, facilitating integration of domain- or simulation-specific "dry-run" validation (LARA-HPC) and run-card-driven reproducibility (HEPTAPOD) (Dawson et al., 24 Apr 2026, Menzo et al., 17 Dec 2025).
- Cross-facility federation: Middleware supports deployment of agents across multiple HPC centers and clouds, using interoperable directory, discovery, and control protocols (Academy, DALIA, RHAPSODY) (Pauloski et al., 8 May 2025, Rodriguez-Sanchez et al., 24 Jan 2026, Alsaadi et al., 23 Dec 2025).
- Best practices: Recommendations include strict separation of decision and execution logic, composing with schema-validated RPC, embedding exception QA, funneling I/O-intensive tasks through aggregation, and keeping a full audit trail and version capture for every workflow and configuration (Sinclair et al., 17 Dec 2025, Alsaadi et al., 23 Dec 2025, Wang et al., 8 Jun 2026).
7. Open Challenges and Future Directions
Despite substantial progress, several research challenges persist:
- Unified data plane abstraction: The need for heterogeneous yet high-performance data management across scientific arrays, tensor stores, and large/mini-batch formats (Brewer et al., 2024).
- Dynamic, agent-aware scheduling: Exploration is ongoing into hybrid runtime systems that can blend HPC-level resource guarantees with highly reactive, deadline-aware agentic control loops (Alsaadi et al., 23 Dec 2025, Lopez et al., 16 Sep 2025).
- Edge/federated extension: Robust, privacy-aware, cross-facility orchestration with adaptive protocols for device churn and variable network performance remains open (Brewer et al., 2024, Pauloski et al., 8 May 2025).
- Motif-driven benchmarking: There is a recognized lack of standard benchmarks exercising agentic execution motifs (e.g., steering, inverse design, adaptive pipeline), impeding systematic performance comparison (Brewer et al., 2024).
- Autonomic policy adaptation and security: Future middleware must integrate policy engines for online adaptation, strict multi-tenant isolation, and support for domain-specific compliance (Alsaadi et al., 23 Dec 2025, Lopez et al., 16 Sep 2025).
The continuing integration of agentic middleware into exascale platforms, dual cloud–HPC “AI Factory” stacks, federated research networks, and community “sandbox” frameworks will define the operational and architectural trajectory of autonomous scientific computing in the coming decade (Lopez et al., 16 Sep 2025, Alsaadi et al., 23 Dec 2025, Rodriguez-Sanchez et al., 24 Jan 2026, Pauloski et al., 8 May 2025).