Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scepsy: Serving Multi-LLM Workflows on GPUs

Updated 4 July 2026
  • Scepsy is a serving system for agentic workflows that orchestrates multiple LLMs by dynamically assigning fractional GPU shares.
  • It aggregates per-LLM statistics into an Aggregate LLM Pipeline to estimate latency contributions and maximize throughput.
  • The system jointly optimizes GPU fractions, tensor parallelism degrees, and replica counts to meet target workflow throughput while minimizing latency.

Scepsy is a serving system for agentic workflows—applications that orchestrate multiple LLMs and tools to solve complex tasks—designed to run on GPU clusters. It is a control-plane and placement system that sits between arbitrary agentic frameworks such as LangChain, LangGraph, Autogen, and Camel, and LLM engines such as vLLM and SGLang. Its stated goal is: given a target workflow throughput, decide how to slice and place GPUs across all LLMs involved so that end-to-end workflow latency is minimized while the throughput target is met, without requiring a fixed agent framework or static workflow graph (Wagenländer et al., 16 Apr 2026).

1. System role and problem setting

Scepsy targets workflows that use multiple heterogeneous LLMs, have dynamic, data-dependent control flow, and often contain more LLM components than there are GPUs, leading to GPU oversubscription (Wagenländer et al., 16 Apr 2026). The motivating cases include branches, loops, fan-out, recursion, and tool calls, together with token-by-token generation lengths that make execution paths difficult to predict. The system is explicitly designed for settings in which more replicas increase throughput but reduce the GPU resources available to each replica, while more tensor parallelism reduces latency but consumes more GPU capacity per replica.

The system is positioned against two broad classes of existing approaches. One class optimizes each LLM independently, through autoscaling or multiplexing, while ignoring cross-LLM coupling in workflows. The other class assumes a specific workflow programming model and focuses on request-level scheduling, leaving GPU allocation to users (Wagenländer et al., 16 Apr 2026). Scepsy instead performs joint allocation across many LLMs in the same workflow, including fractional GPU shares, tensor parallelism degrees, and replica counts.

A central design constraint is framework agnosticism. Scepsy cannot assume a fixed graph or DSL, because workflows are written in many frameworks and sometimes express dynamic control flow in normal code. This leads directly to its statistical workflow model rather than an explicit graph-based one.

2. Statistical workflow model and stable execution shares

Scepsy intentionally does not require a static graph representation. A workflow ww is treated as a stream of workflow-level requests arriving at rate λw\lambda_w, where each request generates a set of LLM-level requests to various LLMs mMm \in \mathcal{M} with arbitrary control flow (Wagenländer et al., 16 Apr 2026). Instead of modeling control flow explicitly as a DAG or recursive graph, Scepsy extracts per-LLM aggregate statistics over many workflow executions:

  • average number of invocations per workflow request for LLM mm: nmn_m
  • average request-level parallelism for LLM mm: pmp_m

These statistics are derived from traces by looking at overlapping timestamps. Because nmn_m and pmp_m are measured empirically from traces, branching is averaged into nmn_m, loops and recursion are captured in λw\lambda_w0, and parallel calls or fan-out are captured in λw\lambda_w1 (Wagenländer et al., 16 Apr 2026). Tools appear in traces but are discarded in the performance model.

The induced LLM-level arrival rate is

λw\lambda_w2

which encodes the average number of calls to LLM λw\lambda_w3 generated by each workflow request (Wagenländer et al., 16 Apr 2026).

The key empirical insight is that end-to-end workflow latency is highly variable across requests, but the fraction of total execution time spent on each LLM is much more stable across requests. In the paper’s beam search example, absolute LLM-level and workflow-level latencies vary widely, while the relative share of time per LLM per request is markedly more concentrated; the paper describes these “relative distributions” as “up to λw\lambda_w4 more stable” (Wagenländer et al., 16 Apr 2026). The stated interpretation is that workflows exhibit structured coupling between LLMs: for example, in beam search, generator expansions are followed by scoring via the verifier, so both LLMs scale in workload together as the search tree grows.

This suggests that exact control-flow reconstruction is not necessary for allocation decisions, provided that aggregate per-LLM shares of total execution time and invocation counts remain stable enough under the target workload.

3. Aggregate LLM Pipeline abstraction

The Aggregate LLM Pipeline is Scepsy’s main abstraction. For each workflow λw\lambda_w5, Scepsy constructs a conceptual pipeline whose stages correspond to unique LLMs used in the workflow. The abstraction aggregates dynamic control-flow behavior into per-LLM statistics λw\lambda_w6 and per-LLM performance profiles λw\lambda_w7 and λw\lambda_w8 under various GPU configurations, then expresses each LLM’s contribution in workflow-level units (Wagenländer et al., 16 Apr 2026).

The workflow-level latency at arrival rate λw\lambda_w9 is approximated as

mMm \in \mathcal{M}0

where mMm \in \mathcal{M}1 is the average per-request latency of LLM mMm \in \mathcal{M}2 at arrival rate mMm \in \mathcal{M}3, and the factor mMm \in \mathcal{M}4 converts LLM-level latency into workflow-level latency contribution by treating mMm \in \mathcal{M}5 calls with average parallelism mMm \in \mathcal{M}6 as mMm \in \mathcal{M}7 sequential units of work (Wagenländer et al., 16 Apr 2026).

The maximum sustainable workflow throughput is approximated as

mMm \in \mathcal{M}8

where mMm \in \mathcal{M}9 is the maximum LLM-level throughput for model mm0 under a particular GPU allocation (Wagenländer et al., 16 Apr 2026). Since each workflow request consumes mm1 LLM calls on average, the bottleneck LLM-level throughput is divided by mm2.

Several properties follow directly from the abstraction. Ordering of stages does not matter, because total latency is a sum of contributions and throughput is limited by the slowest LLM, independent of stage order. Tool time is ignored unless extended, because it is described as typically a few milliseconds and negligible compared to LLM invocations. The pipeline also assumes steady state, additivity of latencies, independent per-LLM profiles once an allocation is fixed, and approximately linear scaling of throughput with replica count while latency per replica is independent of replica count (Wagenländer et al., 16 Apr 2026).

A plausible implication is that the Aggregate LLM Pipeline functions as a compact surrogate model: prediction reduces to table lookups plus simple arithmetic rather than explicit simulation of branching, recursion, and fan-out.

4. Profiling, search, and allocation optimization

Scepsy’s profiling pipeline has four steps. First, it deploys the workflow as-is and sends a representative stream of workflow-level requests. It inserts an HTTP proxy in front of each LLM engine’s completion API and records model ID, request payload, timestamps, and workflow request ID for each LLM request (Wagenländer et al., 16 Apr 2026). These traces are used to derive structural statistics and LLM request contents rather than direct performance curves.

Second, it computes mm3 and mm4 from the execution traces and discards tools and non-LLM components at this step. Third, for each distinct LLM, it replays all extracted LLM-level requests independently from the rest of the workflow, over multiple arrival rates from low load to saturation, and under multiple tensor parallelism degrees mm5 as allowed by hardware (Wagenländer et al., 16 Apr 2026). The replay produces average latency, percentile latencies such as 50th and 95th, and maximum sustainable throughput mm6. Replica count mm7 is not swept in profiling; instead, latency is assumed independent of the number of replicas and throughput scales approximately linearly with replica count.

Fourth, Scepsy synthesizes the Aggregate LLM Pipeline and uses it as the performance model queried by the GPU scheduler (Wagenländer et al., 16 Apr 2026).

The optimization problem is then: given a target workflow throughput, a GPU cluster with fixed capacity and topology, and allowed tensor parallelism degrees, search for per-LLM GPU fraction shares mm8, tensor parallelism degree mm9, and replica count nmn_m0 such that the throughput constraint is met and end-to-end latency is minimized. Scepsy searches over fractional GPU shares, tensor parallelism degrees, and replica counts jointly rather than treating them as separate decisions (Wagenländer et al., 16 Apr 2026).

The search space is large, so the scheduler uses a three-stage heuristic search. It first enumerates GPU fractions across LLMs, with pruning by latency-ratio ordering, minimum fraction per LLM, upper bounds from remaining capacity, contiguous fraction allocation, and limiting tensor parallelism degree to what the hardware interconnect can support without excessive communication (Wagenländer et al., 16 Apr 2026). It then packs fractions onto physical GPUs contiguously and with reduced fragmentation. Finally, it resolves feasible combinations of tensor parallelism degree and replica count subject to divisibility and topology constraints.

The paper reports that search time remains below approximately 35 s with 16 GPUs and 10 fractions per GPU, below approximately 70 s with 3 LLMs and up to 128 GPUs in their tests, and up to approximately 1 s with 3 LLMs, 16 GPUs, and varying fractions per GPU (Wagenländer et al., 16 Apr 2026). These numbers are presented as practical for offline allocation and reconfiguration.

5. Placement, oversubscription, and runtime architecture

Scepsy explicitly supports GPU oversubscription through fractional GPU allocation and co-location. Each physical GPU is conceptually divided into a number of equal fractions, and each LLM replica is assigned one or more fractions, possibly sharing a GPU with other LLMs (Wagenländer et al., 16 Apr 2026). This is motivated by cases where giving a whole GPU per LLM would waste capacity, such as an embedding model that requires only a small fraction of a GPU.

Once an allocation has been selected, Scepsy uses a hierarchical, most-constrained-first placement heuristic. At the inter-node level, it prioritizes tensor-parallel models, placing larger tensor parallelism degrees and larger total GPU fractions first. Candidate nodes are evaluated by a balance score based on the difference between the largest and smallest per-GPU fractions after placement, with ties broken by preferring nodes with smaller remaining capacity so that larger contiguous domains remain available later (Wagenländer et al., 16 Apr 2026). After placing tensor-parallel models, non-tensor-parallel models are placed across remaining nodes, still prioritizing larger allocations first.

At the intra-node level, tensor-parallel slices are placed across GPUs in a NVLink domain as required, while smaller fractions are packed onto already-occupied GPUs first in order to maximize packing density and minimize fragmentation (Wagenländer et al., 16 Apr 2026). The system then generates Kubernetes deployment manifests encoding node placement and per-pod GPU index or fraction assignments via an extended NVIDIA device plugin.

Runtime enforcement uses NVIDIA MPS so that each LLM process gets fixed shares of SMs and memory; this is described as spatial multiplexing rather than heavy context-switching temporal multiplexing (Wagenländer et al., 16 Apr 2026). Scepsy itself is mainly a control-plane system. In the offline or setup phase, it traces workflows, profiles each LLM, builds Aggregate LLM Pipelines, runs the scheduler, generates Kubernetes deployment manifests and device plugin configurations, and deploys vLLM or SGLang engines together with SGLang routers. In the data plane, requests are sent to the SGLang router or workflow front-end, and the underlying engine handles batching, prefix caching, and KV-cache-aware behavior (Wagenländer et al., 16 Apr 2026).

A common misconception would be to treat Scepsy as an online autoscaler. The paper instead describes it as an offline optimizer: allocations are computed for a given target throughput and then deployed, and it does not continuously re-optimize under short-term load fluctuations (Wagenländer et al., 16 Apr 2026).

6. Evaluation, limitations, and research context

The evaluation uses an on-premises cluster with 4 machines and 4 GPUs per machine, for 16 GPUs total. Each machine has an AMD EPYC 7402P CPU and 4 NVIDIA RTX A6000 GPUs connected by PCIe 4.0, with two pairs of GPUs within a node connected by 3rd-gen NVLink; nodes are interconnected with 100 Gbps InfiniBand. The software stack includes MicroK8s v1.32, vLLM 0.17, vLLM 0.2.2 for Ayo comparability, and the SGLang model gateway as router (Wagenländer et al., 16 Apr 2026).

The workloads are a RAG + reranker workflow with e5-base-v2 and LLaMA-3-8B, a beam search reasoning workflow with LLaMA-3.2-1B and LLaMA-3.1-8B PRM, and concurrent combined workloads (Wagenländer et al., 16 Apr 2026). Baselines are Kubernetes autoscaler, Aegaeon, and Ayo.

Comparison Throughput improvement Latency reduction
vs Kubernetes autoscaler up to 2.4× up to 27×
vs Aegaeon up to 7.3× up to 14.1×
vs Ayo up to 8.2× not summarized as a single maximum

For beam search versus Kubernetes autoscaler, the reported throughput improvement is up to 2.4× on 4 GPUs, 1.5× on 8 GPUs, and 1.8× on 16 GPUs, with latency reductions of 1.4×–7.6× on 4 GPUs, 1.3×–5.2× on 8 GPUs, and 1.7×–10.4× on 16 GPUs (Wagenländer et al., 16 Apr 2026). For RAG + reranker, the throughput improvement is 1.5×, 1.2×, and 1.2× for 4, 8, and 16 GPUs respectively, while latency reductions are 1.9×–14.9× on 4 GPUs, 1.2×–7.2× on 8 GPUs, and up to 27× on 16 GPUs. The reported explanation is that Kubernetes autoscaler oscillates with dynamic workflow behavior and does not reason about tensor parallelism or cross-LLM coupling.

Against Aegaeon, Scepsy reports beam-search throughput improvements of 7.3× on 4 GPUs and 6.8× on 8 GPUs, with latency improvements of 14.1× and 12.9×; for RAG + reranker, the reported throughput improvements are 2.5× and 2.9×, and the latency improvements are 3.4× and 4.5× (Wagenländer et al., 16 Apr 2026). The paper attributes this to Aegaeon lacking prefix caching and incurring pre-decode or prefill swapping overhead, while Scepsy benefits from pipeline-level optimization together with standard vLLM features.

Against Ayo, the paper states that Ayo can sometimes slightly beat Scepsy at certain points in low-throughput, latency-bound regions, but that Ayo saturates throughput as load grows while Scepsy continues to scale (Wagenländer et al., 16 Apr 2026). The reported throughput advantage for Scepsy is 3.2× on 4 GPUs and 8.2× on 8 GPUs for beam search, and 2.1× on 4 GPUs and 2.4× on 8 GPUs for RAG + reranker. The stated interpretation is that request-level scheduling is not enough when there are multiple heterogeneous LLMs and limited GPUs; GPU allocation and tensor-parallelism or replica decisions dominate throughput scaling.

The ablation study attributes the largest throughput improvement in RAG + reranker to co-location through fractional GPUs, and the primary latency reduction in beam search to tensor parallelism for the large verifier on multi-GPU TP (Wagenländer et al., 16 Apr 2026). Without both optimizations, the system suffers higher latency and lower throughput.

The paper also states several limitations. The Aggregate LLM Pipeline assumes stationary workloads with stable nmn_m1, nmn_m2, and per-LLM profiles. If workflows fan out across different LLMs in parallel, the sum-of-contributions model may overestimate latency by modeling them as serial. If a single LLM is reused in different roles with bimodal workload patterns, a single profile may be inaccurate. Tool execution time is ignored, so workloads dominated by long-running tools would be mis-modeled. The evaluation is limited to 16 GPUs and models up to 8B scale, and the system does not provide continuous online adaptation (Wagenländer et al., 16 Apr 2026).

In relation to prior systems, Scepsy is described as distinct from single-LLM serving engines such as vLLM, Sarathi-Serve, NanoFlow, and PodAttention; distributed LLM systems such as DistServe, LoongServe, and Mooncake; multi-LLM multiplexing systems such as AlpaServe, MuxServe, Prism, and Aegaeon; and agentic workflow systems such as Parrot, Ayo, TokenCake, Autellix, and JITServe (Wagenländer et al., 16 Apr 2026). Its claimed novelty is the combination of framework-agnostic tracing, workflow-aware GPU allocation, a continuous resource model with fractional GPUs plus tensor parallelism and replicas, and the Aggregate LLM Pipeline as a simple but effective abstraction for arbitrary agentic workflows.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scepsy.