Papers
Topics
Authors
Recent
2000 character limit reached

CORTEX Prompt: Workflow-Aware Serving

Updated 10 November 2025
  • CORTEX Prompt is a workflow-aware serving platform that isolates each logical stage with dedicated resource pools and specialized scheduling.
  • It improves performance by reducing GPU KV-cache usage, doubling effective batch sizes, and lowering latency through SLO-aware prioritization.
  • It supports advanced strategies like malleable resource management and speculative execution to efficiently scale agentic workflows.

Cortex is a workflow-aware serving platform architected for agentic workloads that emphasizes strict stage isolation: it assigns dedicated resource pools and specialized scheduling to each logical stage—such as LLM calls or tool executions—within an agentic workflow. By isolating resources and employing sophisticated, per-stage scheduling and autoscaling, Cortex eliminates cross-stage compute and memory interference, substantially improves key–value (KV) cache utilization, and delivers both higher throughput and more predictable latency. This platform also lays the groundwork for future serving strategies appropriate to agent-native workloads, such as malleable resource management, speculative execution, and workflow-wide multi-tiered caching.

1. Architectural Principles: Stage Isolation and Workflow Structure

Cortex’s central design is stage isolation. Each stage of a multi-step agentic workflow (e.g., “SQL generator,” “SQL executor,” “SQL error fixer”) is hosted in its own homogeneous engine pool. This architecture avoids shared-pool effects where dissimilar prompt structures or cache footprints would otherwise compete for memory and compute resources.

Logical Components

  • Orchestrator: Receives client requests, loads a compiled operator-call graph (LLM/tool nodes), and tracks per-request SLO (service-level objective) slack.
  • Stage Pools: Each stage maintains:
    • A local priority queue (ordered by computed urgency)
    • A bank of homogeneous engines (e.g., identical GPU workers)
    • A mini-scheduler handling admission, load balancing, KV cache affinity, and local autoscaling
  • Engine-Allocation Layer: Enforces per-pool utilization targets (scaling triggers on queue length, utilization signals).

In a canonical NL2SQL workflow, Cortex separates:

  1. SQL generation (LLM inference)
  2. SQL execution (external database/tool)
  3. SQL error fixing (LLM inference), iterating until syntactic/semantic success or retry budget exhaustion.

2. Resource Allocation, Scheduling, and Prioritization

Cortex’s resource management and scheduling strategies draw on classic queueing theory, SLO-aware heuristics, and explicit cache affinity models.

Per-Stage Capacity Sizing

  • Implied M/M/c model: For stage kk, arrival rate λk\lambda_k, service rate μk\mu_k (tokens/sec or queries/sec). The number of engines ckc_k is chosen so Erlang-C tail queue probability stays below a target threshold: min{ck:ErlangC(ck,ρk)δk}\min\{c_k : \mathrm{ErlangC}(c_k, \rho_k) \leq \delta_k\}, where ρk=λk/μk\rho_k = \lambda_k/\mu_k.
  • In practice, Cortex measures empirical 95th-percentile (p95p_{95}) latency for each stage and adjusts replica count up or down until observed p95p_{95} aligns with SLO-derived budgets.

SLO-Aware Prioritization

  • Each request ii tracks remaining slack sis_i (time before deadline).
  • For each candidate operator jj, the orchestrator estimates expected service time tijt_{ij} and selectivity σj\sigma_j (probability workflow continues past jj).
  • Priority Key: Pij=si/(σjtij)P_{ij} = s_i / (\sigma_j \cdot t_{ij}) (higher PijP_{ij} = more urgent).
  • Stage schedulers dequeue the highest PijP_{ij} tasks first, subject to admission control limits.

Load Balancing and KV-Cache Affinity

  • Upon dispatch, schedulers rank replicas by:

    1. Warm KV cache for the current prompt template?
    2. Least number of in-flight tasks.
  • Idle, cache-warm engines are favored, even for lower-priority jobs, to reduce cold start penalties.

Dynamic Scaling

  • Monitors queue length Qk(t)Q_k(t) and utilization Uk(t)U_k(t).
  • If Qk(t)>QhighQ_k(t) > Q_\mathrm{high} or Uk(t)>UtargetU_k(t) > U_\mathrm{target}, the system scales out (up to a stage cap CmaxC_\mathrm{max}).
  • If Qk(t)<QlowQ_k(t) < Q_\mathrm{low} and Uk(t)<Utarget,lowU_k(t) < U_\mathrm{target,low}, the system scales in.
  • Thresholds are customizable per stage to reflect criticality to SLOs.

3. Stage Isolation: Impact on KV-Cache and Performance

A principal advantage of Cortex’s stage isolation is efficient GPU KV-cache utilization:

  • Shared Pool: Each GPU must host contexts for all prompt variants (e.g., both SQL generation and error-fixing), leading to an aggregate context size ≈ 8 GB/GPU.
  • Stage-Isolated Pools: Each GPU holds only the specific context for its assigned stage (4 GB per GPU for each role), but with roles partitioned across different replicas. Total resident memory is thereby reduced by ~40%.
  • Consequences:
    • Effective GPU batch size doubles (from 4 to 8 sequences).
    • Beam width increases without host spillover.
    • In NL2SQL loops, end-to-end throughput rises by 25%.
    • 95th-percentile latency falls by 15–20 ms in benchmark workloads.

This demonstrates that strict memory partitioning, rather than opportunistic KV context sharing, is crucial under diverse agentic workloads.

4. Multi-Tiered Cache and Workflow Scheduling Paradigms

Cortex outlines both implemented and forward-looking multi-tiered caching and scheduling:

Multi-Tiered Agentic State Cache

  • Tier 1: Engine-local KV cache (in-GPU, prompt activations).
  • Tier 2: Workflow-wide shared, in-memory publish/subscribe cache (e.g., for intermediate SQL results or schema embeddings).
  • Tier 3: Persistent backing store (long-term, cold-state objects).
  • The design enables “publish” and “subscribe” primitives for agents, allowing subsequent tasks to retrieve agentic state and avoid recomputation, particularly in concurrent or batched workflows.

Workflow Scheduling: Descriptive Pseudocode

The orchestrator and schedulers operate as follows:

Main Loop:

  • For each request RR:
    • Compile call graph G(R)G(R), initialize slack sRs_R to user SLO.
    • Estimate per-operator selectivity σ\sigma.
  • While RR incomplete and sR>0s_R > 0:
    • Enqueue ready operators with priority P=sR/(σt^)P = s_R / (\sigma \cdot \hat t).
    • sRsRs_R \gets s_R minus elapsed time.

Stage-Pool Scheduler:

  • If local queue nonempty and engine idle:
    • Dequeue highest-PP task, dispatch to least-loaded, cache-warm replica.
  • If queue or utilization cross thresholds:
    • Trigger autoscaler, scaling out/in as needed.

5. Agent-Native Extensions: Malleability and Speculation

Cortex proposes future extensions for agent-oriented serving:

Malleable Resource Management

  • Each workflow stage exposes tunable “malleability envelope” parameters (model size, parallelism, retry budget).
  • When slack shrinks or cluster load spikes:
  • Conversely, under light load, orchestrator can allocate heavier resources (multi-GPUs, broader batches) for low-criticality branches.
  • Requires a uniform service-manifest interface declaring each stage’s malleability and cost profile.

Speculative Execution of Workflow Branches

  • Orchestrator predicts likely next actions and pre-warms (or even pre-executes) associated stage pools (e.g., top-K SQL repairs).
  • Correct speculation leads to latency savings up to the speculated step’s execution time; incorrect speculations are bounded by admission and speculative-budget controls.

A plausible implication is that, as agentic workflows become more complex—e.g., with branching, retries, and multi-tool chains—such malleability and speculation will become essential for both efficiency and consistent SLO adherence.

6. Performance Outcomes and Future Directions

Cortex demonstrates that agentic workflow structure should determine not only control flow but also physical resource partitioning, prioritization, and cache residency.

Optimization Metric Improvement
Stage isolation 40% KV cache reduction per GPU
2× batch size, +25% NL2SQL throughput
–15–20 ms p95 latency
SLO-prioritization More predictable tail latency
Malleability/speculation Consistency under SLO pressure

Cortex’s “prompt” to the community is to architect serving systems around workflow-call graphs, isolate each logical stage into its own compute/memory pool, and adopt stage-specific, SLO-aware allocation and speculative, malleable scheduling. By doing so, systems practitioners can deliver predictable, scalable, and efficient serving of increasingly complex agentic workloads, enabling new serving paradigms for agent-native ML architectures. The model supports evolutionary extensions toward shared-state multi-tiered caching, more flexible resource sharing, and speculative pre-execution throughout agent workflow graphs (Pagonas et al., 15 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CORTEX Prompt.