CORTEX Prompt: Workflow-Aware Serving
- CORTEX Prompt is a workflow-aware serving platform that isolates each logical stage with dedicated resource pools and specialized scheduling.
- It improves performance by reducing GPU KV-cache usage, doubling effective batch sizes, and lowering latency through SLO-aware prioritization.
- It supports advanced strategies like malleable resource management and speculative execution to efficiently scale agentic workflows.
Cortex is a workflow-aware serving platform architected for agentic workloads that emphasizes strict stage isolation: it assigns dedicated resource pools and specialized scheduling to each logical stage—such as LLM calls or tool executions—within an agentic workflow. By isolating resources and employing sophisticated, per-stage scheduling and autoscaling, Cortex eliminates cross-stage compute and memory interference, substantially improves key–value (KV) cache utilization, and delivers both higher throughput and more predictable latency. This platform also lays the groundwork for future serving strategies appropriate to agent-native workloads, such as malleable resource management, speculative execution, and workflow-wide multi-tiered caching.
1. Architectural Principles: Stage Isolation and Workflow Structure
Cortex’s central design is stage isolation. Each stage of a multi-step agentic workflow (e.g., “SQL generator,” “SQL executor,” “SQL error fixer”) is hosted in its own homogeneous engine pool. This architecture avoids shared-pool effects where dissimilar prompt structures or cache footprints would otherwise compete for memory and compute resources.
Logical Components
- Orchestrator: Receives client requests, loads a compiled operator-call graph (LLM/tool nodes), and tracks per-request SLO (service-level objective) slack.
- Stage Pools: Each stage maintains:
- A local priority queue (ordered by computed urgency)
- A bank of homogeneous engines (e.g., identical GPU workers)
- A mini-scheduler handling admission, load balancing, KV cache affinity, and local autoscaling
- Engine-Allocation Layer: Enforces per-pool utilization targets (scaling triggers on queue length, utilization signals).
In a canonical NL2SQL workflow, Cortex separates:
- SQL generation (LLM inference)
- SQL execution (external database/tool)
- SQL error fixing (LLM inference), iterating until syntactic/semantic success or retry budget exhaustion.
2. Resource Allocation, Scheduling, and Prioritization
Cortex’s resource management and scheduling strategies draw on classic queueing theory, SLO-aware heuristics, and explicit cache affinity models.
Per-Stage Capacity Sizing
- Implied M/M/c model: For stage , arrival rate , service rate (tokens/sec or queries/sec). The number of engines is chosen so Erlang-C tail queue probability stays below a target threshold: , where .
- In practice, Cortex measures empirical 95th-percentile () latency for each stage and adjusts replica count up or down until observed aligns with SLO-derived budgets.
SLO-Aware Prioritization
- Each request tracks remaining slack (time before deadline).
- For each candidate operator , the orchestrator estimates expected service time and selectivity (probability workflow continues past ).
- Priority Key: (higher = more urgent).
- Stage schedulers dequeue the highest tasks first, subject to admission control limits.
Load Balancing and KV-Cache Affinity
- Upon dispatch, schedulers rank replicas by:
- Warm KV cache for the current prompt template?
- Least number of in-flight tasks.
Idle, cache-warm engines are favored, even for lower-priority jobs, to reduce cold start penalties.
Dynamic Scaling
- Monitors queue length and utilization .
- If or , the system scales out (up to a stage cap ).
- If and , the system scales in.
- Thresholds are customizable per stage to reflect criticality to SLOs.
3. Stage Isolation: Impact on KV-Cache and Performance
A principal advantage of Cortex’s stage isolation is efficient GPU KV-cache utilization:
- Shared Pool: Each GPU must host contexts for all prompt variants (e.g., both SQL generation and error-fixing), leading to an aggregate context size ≈ 8 GB/GPU.
- Stage-Isolated Pools: Each GPU holds only the specific context for its assigned stage (4 GB per GPU for each role), but with roles partitioned across different replicas. Total resident memory is thereby reduced by ~40%.
- Consequences:
- Effective GPU batch size doubles (from 4 to 8 sequences).
- Beam width increases without host spillover.
- In NL2SQL loops, end-to-end throughput rises by 25%.
- 95th-percentile latency falls by 15–20 ms in benchmark workloads.
This demonstrates that strict memory partitioning, rather than opportunistic KV context sharing, is crucial under diverse agentic workloads.
4. Multi-Tiered Cache and Workflow Scheduling Paradigms
Cortex outlines both implemented and forward-looking multi-tiered caching and scheduling:
Multi-Tiered Agentic State Cache
- Tier 1: Engine-local KV cache (in-GPU, prompt activations).
- Tier 2: Workflow-wide shared, in-memory publish/subscribe cache (e.g., for intermediate SQL results or schema embeddings).
- Tier 3: Persistent backing store (long-term, cold-state objects).
- The design enables “publish” and “subscribe” primitives for agents, allowing subsequent tasks to retrieve agentic state and avoid recomputation, particularly in concurrent or batched workflows.
Workflow Scheduling: Descriptive Pseudocode
The orchestrator and schedulers operate as follows:
Main Loop:
- For each request :
- Compile call graph , initialize slack to user SLO.
- Estimate per-operator selectivity .
- While incomplete and :
- Enqueue ready operators with priority .
- minus elapsed time.
Stage-Pool Scheduler:
- If local queue nonempty and engine idle:
- Dequeue highest- task, dispatch to least-loaded, cache-warm replica.
- If queue or utilization cross thresholds:
- Trigger autoscaler, scaling out/in as needed.
5. Agent-Native Extensions: Malleability and Speculation
Cortex proposes future extensions for agent-oriented serving:
Malleable Resource Management
- Each workflow stage exposes tunable “malleability envelope” parameters (model size, parallelism, retry budget).
- When slack shrinks or cluster load spikes:
- Downshift to lightweight LLMs (e.g., swap 70B for 7B).
- Prune speculative branches or reduce retry loops.
- Limit chain-of-thought (truncate decoding).
- Conversely, under light load, orchestrator can allocate heavier resources (multi-GPUs, broader batches) for low-criticality branches.
- Requires a uniform service-manifest interface declaring each stage’s malleability and cost profile.
Speculative Execution of Workflow Branches
- Orchestrator predicts likely next actions and pre-warms (or even pre-executes) associated stage pools (e.g., top-K SQL repairs).
- Correct speculation leads to latency savings up to the speculated step’s execution time; incorrect speculations are bounded by admission and speculative-budget controls.
A plausible implication is that, as agentic workflows become more complex—e.g., with branching, retries, and multi-tool chains—such malleability and speculation will become essential for both efficiency and consistent SLO adherence.
6. Performance Outcomes and Future Directions
Cortex demonstrates that agentic workflow structure should determine not only control flow but also physical resource partitioning, prioritization, and cache residency.
| Optimization | Metric Improvement |
|---|---|
| Stage isolation | 40% KV cache reduction per GPU |
| 2× batch size, +25% NL2SQL throughput | |
| –15–20 ms p95 latency | |
| SLO-prioritization | More predictable tail latency |
| Malleability/speculation | Consistency under SLO pressure |
Cortex’s “prompt” to the community is to architect serving systems around workflow-call graphs, isolate each logical stage into its own compute/memory pool, and adopt stage-specific, SLO-aware allocation and speculative, malleable scheduling. By doing so, systems practitioners can deliver predictable, scalable, and efficient serving of increasingly complex agentic workloads, enabling new serving paradigms for agent-native ML architectures. The model supports evolutionary extensions toward shared-state multi-tiered caching, more flexible resource sharing, and speculative pre-execution throughout agent workflow graphs (Pagonas et al., 15 Oct 2025).