Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Serving Frameworks

Updated 3 March 2026
  • Agentic Serving Frameworks are specialized systems that define and execute LLM-driven workflows as directed acyclic graphs with clear control and data dependencies.
  • They utilize dedicated resource pooling, adaptive routing, and MILP-based, profile-guided optimization to balance cost, latency, and throughput across multi-stage pipelines.
  • ASFs integrate dynamic state management and policy-driven governance to ensure robust security, compliance with service-level objectives, and efficient multi-agent orchestration.

Agentic Serving Frameworks (ASFs) are specialized orchestration and runtime systems that support the scalable, robust, and efficient serving of LLM-driven agentic workflows—multi-stage computational graphs involving goal-directed, tool-integrating, often multi-agent pipelines. ASFs distinguish themselves from legacy serving and workflow engines by exposing explicit workflow structure, decoupling logical specification from execution details, and enabling cross-layer optimizations for throughput, latency, resource utilization, and service-level objectives (SLOs). Foundational design patterns include dedicated resource pooling per workflow stage, adaptive routing and scheduling, profile-guided optimization, decoupled state management, and policy-driven governance. These systems serve as the architectural backbone for modern agentic AI solutions across data analytics, enterprise automation, cloud platforms, and multi-modal applications.

1. Core Principles and Formal Definitions

ASFs formalize the serving of agentic workloads as directed acyclic graphs (DAGs) of operators—LLM calls, tool invocations, and decision points—with explicit edges marking data and control dependencies (Pagonas et al., 15 Oct 2025, Shen et al., 2 Sep 2025). For a workflow W=(V,E)W=(V,E), vertices vVv\in V represent operators, and EE expresses dependency structure. Stage isolation is an essential property: partition VV into disjoint “stages” S1,,SkS_1,\ldots,S_k, each assigned an exclusive resource pool PjP_j, i.e., PjPj=P_j\cap P_{j'} = \emptyset for jjj\neq j' (Pagonas et al., 15 Oct 2025). Key ASF abstractions include:

  • Declarative specification: Application logic defined independently of hardware, model, or batch parameters (e.g., Murakkab’s DSL for DAGs) (Chaudhry et al., 22 Aug 2025).
  • Resource pools and scheduling: Dedicated worker sets per stage with per-operator admission and priority-based dispatch to enforce SLOs, minimize inter-stage contention, and optimize utility (Pagonas et al., 15 Oct 2025).
  • Separation of concerns: Workflow logic (control/data flow) is uncoupled from execution configuration (resource selection, scaling, admission), enabling cross-layer optimization (Chaudhry et al., 22 Aug 2025).

The ASF runtime instantiates mappings f:V×Requests(executor,hardware,parameters)f:V\times Requests\to(executor, hardware, parameters), subject to constraints over latency, accuracy, energy, and cost per request (Chaudhry et al., 22 Aug 2025). Profiling and optimization integrate per-task and per-model cost/accuracy data, typically encoded as MILP constraints for global or per-stage solving (Chaudhry et al., 22 Aug 2025, Dai et al., 26 Nov 2025).

2. Architectures and System Components

ASFs universally adopt layered, modular architectures to address heterogeneity, isolation, and flexibility in agentic serving. Representative architectural features include:

Layer Example Implementation Key Responsibilities
Resource/Infra GPU/CPU pools, network substrate (e.g., vLLM, Triton) Hardware abstraction and provisioning
Execution/Task Worker engines, event loops, autoscaling Admission, parallel execution, fine-grained scaling
Orchestration/Workflow Orchestrator (DAG or FSM executor), priority manager Dependency scheduling, SLO slack computation
State and Memory KV-caches, vector DBs (e.g., Pancake), managedDict State sharing, caching, migration
Data/Control Plane Shim APIs, set/reset protocol (SD-AS), metrics collectors Real-time configuration, monitoring, closed-loop
Optimization/Policy MILP solvers, beam search, malleability interface Routing, resource assignment, QoS control
Governance/Monitoring Audit logs, policy enforcement (ELK, Prometheus) Compliance, HITL, role-based access, auditing

Notable instantiations:

  • Cortex: Stage-isolated pools, per-stage queueing and priority dispatch, two-tier KV-caching (Pagonas et al., 15 Oct 2025).
  • Aragog: One-time routing via accuracy-preserving config discovery, per-stage scheduling adaptive to runtime load (Dai et al., 26 Nov 2025).
  • Murakkab: Declarative workflow DAG, MILP for global resource/cost/latency optimization, epoch-based and local adaptive scaling (Chaudhry et al., 22 Aug 2025).
  • Pancake: Three-tier multi-agent memory system, hybrid CPU/GPU acceleration, shared-private vector store (Hu et al., 25 Feb 2026).
  • Nalar: Auto-generated Python futures for agent/tool calls, managed state layer, two-level adaptive control (Laju et al., 8 Jan 2026).
  • Software-Defined Agentic Serving: SDN-inspired plane split, programmable data-plane shim, intent-driven closed control loops (Agarwal et al., 6 Jan 2026).

ASF architectural blueprints consistently decouple workflow structure, state management, resource allocation, and high-level policy/intent.

3. Resource Management, Scheduling, and Optimization

ASFs implement advanced algorithms for resource allocation, task scheduling, and cost/latency-quality tradeoff management:

  • Resource Pooling and Stage Isolation: Resources (GPUs, inference servers) partitioned per stage to prevent KV-cache interference and cross-stage contention; isolated pools result in reduced memory footprint (e.g., 3.2 GB vs 5.1 GB), 10% p99 latency reduction, and 12% throughput gain (Pagonas et al., 15 Oct 2025).
  • MILP and Profile-Guided Optimization: Systems like Murakkab use declarative DAGs and offline model/workflow profiles to solve multi-objective MILPs (min energy/cost, meet SLOs), achieving up to 4.3× cost reduction under mixed workloads (Chaudhry et al., 22 Aug 2025).
  • Just-in-Time Model Routing: Aragog decomposes the configuration selection problem into offline identification of all accuracy-equivalent model assignments and a fast, beam-search–based per-stage scheduler, yielding up to 217% throughput gains and ≤1% accuracy drop (Dai et al., 26 Nov 2025).
  • Cross-Stage Pipelining and Batching: Halo and Nova maximize GPU/SM utilization via DAG-consolidated batching, adaptive phase-aware batch sizing, and cross-stage parallelization. Halo achieves up to 18.6× offline and 4.7× online throughput improvement (Shen et al., 2 Sep 2025), Nova attains 23.3% max E2E latency reduction (Xu et al., 25 Sep 2025).

Table: Resource Efficiency Impact (Murakkab 24hr, Multi-workload) (Chaudhry et al., 22 Aug 2025)

Policy GPU Used Energy (MWh) Cost (k$)
Static 2,560 78.4 230.0
Opt (per-wf) 1,151 27.1 56.2
Opt+Mult (global) 908 21.6 46.5

These strategies collectively reduce cost, energy, and latency while satisfying diverse SLOs.

4. State, Memory, and Plan Caching

Dynamic state and memory management are critical for agentic workloads:

  • Multi-tier KV/Vector Caching: Two-level (engine-local + shared) caches optimize LLM decode throughput and support cache sharing across requests (Cortex, Pancake) (Pagonas et al., 15 Oct 2025, Hu et al., 25 Feb 2026).
  • Profiled and Predictive Memory: Pancake models per-agent memory access patterns as finite-state machines (FSMs) to drive proactive cache prefetch, sharply reducing ANN search overhead (up to 4.29× throughput over baselines) (Hu et al., 25 Feb 2026).
  • Plan and Template Caching: Agentic Plan Caching extracts and stores skeleton plan templates at test time, enabling lightweight adaptation and 46.62% serving cost reduction without significant accuracy degradation (Zhang et al., 17 Jun 2025).
  • Managed State Layer: Nalar decouples logical from physical state, enabling transparent migration and safe reuse for retries, and allowing load-balancing and efficient session management (Laju et al., 8 Jan 2026).

The convergence of cache-tiering, profile-driven search, and plan adaptation enables high concurrency, memory efficiency, and robust recovery in ASF deployments.

5. Security, Governance, and Safety

ASFs are required to address attack surfaces distinct from vanilla LLM deployments due to complex agent composition, tool use, and inter-agent messaging:

  • Task Decomposition and Policy Enforcement: Modularity enables insertion of policy checks, input sanitization, and schema validation at delegation and aggregation points (Nguyen et al., 16 Dec 2025).
  • Multi-Agent Hardening: Peer-to-peer validation among domain agents (e.g., in AutoGen), along with redundancy and orthogonal policy enforcement, increases attack refusal rates by ≈21.5 percentage points vs. hierarchical-only orchestration (CrewAI) (Nguyen et al., 16 Dec 2025).
  • Defensive Patterns and Failure Modes: Observed behaviors include not only explicit refusal and redirection, but also unsafe completions and hallucinated compliance; robust frameworks must monitor for such misbehaviors (Nguyen et al., 16 Dec 2025).
  • Governance and Monitoring: Layers for logging, auditing, and SLA enforcement, supported by real-time and periodic red-team evaluation, are central to continuous security hardening and compliance (Bandara et al., 27 Jan 2026, Nguyen et al., 16 Dec 2025).

Security and governance mechanisms are embedded orthogonally across the ASF stack, requiring integration both at orchestration and tool/plugin interfaces.

6. Extensible Control and Adaptation

ASFs emphasize runtime adaptability and programmability, centralizing control, telemetry, and automation:

  • Intent-Driven, Software-Defined Control: A data/metrics/control-plane split exposes tunable primitives (e.g., batch size, stream mode, routing rules) that are adjusted in closed-loop response to telemetry, enabling dynamic operation under complex SLOs (Agarwal et al., 6 Jan 2026).
  • Adaptive Workflow and Data Plane: APIs for fine-grained policy and primitive installation permit real-time adaptation to latency breaches, load variation, and job prioritization without redeployment (Agarwal et al., 6 Jan 2026).
  • Policy-Driven Scheduling and Migration: Two-level control (component-level event-driven mechanisms + periodic global aggregation) yields orders-of-magnitude reductions in tail latency and supports large-scale future management (130k+ active workflows) (Laju et al., 8 Jan 2026).

Open research questions persist in formalizing malleability contracts (latitude for model swaps, workflow pruning), controlling speculative computation, and optimizing for composite QoS objectives (Pagonas et al., 15 Oct 2025, Dai et al., 26 Nov 2025).

7. Broader Context and Research Frontier

ASFs are recognized as instantiations of the agentic service computing (ASC) paradigm—the continuous engineering of autonomous, multi-agent, self-adaptive services orchestrated across design, deployment, operation, and evolution phases (Deng et al., 29 Sep 2025). ASF best practices mandate:

  • Declarative separation of logic and execution;
  • Policy-based, observability-rich governance;
  • Standardized communication protocols (A2A, CNP, ANP, MCP);
  • Robust safety guardrails and auditable escalation paths;
  • Continuous evaluation and adaptive retraining (“evolution-by-data”).

Emerging trends include large-scale memory sharing, proactive orchestration, resource efficiency (“green ASC”), and hybrid human–AI collaboration/oversight (Deng et al., 29 Sep 2025, Bandara et al., 27 Jan 2026). Benchmarking, open-interface standardization, and scalable design for multi-agent society coordination are active areas for ASF research and development.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Serving Frameworks (ASFs).