Goodput-Optimized LLM Serving
- Goodput-optimized LLM serving is a method that maximizes user-visible token delivery by optimizing the entire end-to-end inference pipeline under realistic hardware and concurrency constraints.
- It integrates workload-aware scheduling, dynamic resource allocation, and low-level memory and batching optimizations to achieve significant latency reduction and throughput gains.
- Empirical results show multi-fold improvements in both end-to-end latency and aggregate capacity, outperforming traditional throughput- or latency-centric strategies.
A goodput-optimized LLM serving system is designed to maximize the delivery of user-visible, SLO-compliant tokens or requests per unit time, under realistic concurrency and hardware constraints, by addressing bottlenecks across the entire end-to-end inference pipeline. This approach integrates workload-aware scheduling, system co-design, dynamic resource allocation, and low-level memory optimizations to ensure that useful tokens—those meeting strict service-level objectives (SLOs)—are delivered at maximal sustainable rates, as opposed to mere aggregate throughput. Contemporary research demonstrates that goodput optimization outperforms throughput- or latency-centric strategies by explicitly modeling practical constraints such as gateway overhead, model-parallelism trade-offs, GPU and KV-cache utilization, and variable request lengths, yielding substantial (often multi-fold) improvements in both user-perceived latency and aggregate service capacity (Yao et al., 2024, Liao et al., 26 Nov 2025, Zhong et al., 2024, Hu et al., 6 Jun 2025).
1. Definition of Goodput and Core Metrics
Goodput, in the context of LLM serving, is formally defined as the effective rate of user-visible output—either as tokens per second or SLO-satisfied requests per second—net of all pipeline, network, and engine overheads:
where is the number of generated tokens and is the wall-clock interval.
Standard LLM serving systems further subdivide latency into:
- End-to-End Latency (Avg Latency): , from request arrival at the gateway () to the first token's arrival or full-output delivery ().
- Engine Latency: Model inference portion, .
- Gateway Latency: Pre- and post-inference, .
- Streaming-Specific Metrics:
- Time to First Token (TTFT):
- Time Between Tokens (TBT):
Goodput under SLOs is the total tokens (or complete requests) delivered within all per-token (TTFT, TBT, TPOT) and per-request deadlines, divided by elapsed time (Yao et al., 2024, Wang et al., 2024, Liao et al., 26 Nov 2025). Only work that meets application-defined deadlines is counted as "good" work.
2. End-to-End Pipeline Structure and Bottlenecks
Modern LLM serving is composed of two major system layers:
A. Gateway Layer (Replica Router)
- Responsibilities: HTTP/gRPC ingress, authentication and rate limiting, payload (JSON/protobuf) parsing, replica assignment, metrics.
- At low concurrency, overhead is insignificant; at high QPS (≥ 64), CPU-bound bottlenecks (Python GIL, serialization, authentication) and network IO can saturate the gateway, inflating tail latency and suppressing goodput.
B. Inference Engine Layer (Per-Replica)
- Responsibilities: Request batching, token generation (attention, MLP), key-value (KV) cache management, GPU resource orchestration.
- Bottlenecks:
- GPU under-utilization due to suboptimal or highly variable batch sizes.
- KV-cache fragmentation: inefficient memory allocation can sharply limit maximal batch sizes and introduce allocation thrashing.
- Inferior model parallelism selection (tensor parallelism, expert parallelism, or hybrid), leading to imbalanced compute and bandwidth utilization.
- Overheads from (de-)serialization and inter-GPU communication for distributed models.
This pipeline reflects the insight that inefficiencies in the gateway or data preprocessing (not just within the LLM engine itself) directly throttle goodput under sustained load (Yao et al., 2024).
3. Systemic and Algorithmic Optimizations for Goodput
3.1 Gateway Optimizations
- Rust/Tokio/Axum + gRPC (Tonic & Protobuf): Migration away from Python/FastAPI eliminates GIL-imposed contention, allowing asynchronous, multi-core scaling. End-to-end connection handshake and serialization per request drop from ≈1 ms to ≈100 μs, halving gateway overhead (Yao et al., 2024).
- Task Pipelining: Leveraging Tokio's async runtime executes CPU-bound tasks (auth, JSON parsing, rate limiting) concurrently with network IO.
- Batching and Connection Pooling: By pooling gRPC channels, the setup cost is amortized, benefiting bursts of incoming requests.
3.2 Inference Engine Optimizations
- Parallelism Tuning (TP vs. EP vs. Hybrid):
- At low concurrency, high-degree tensor parallelism (e.g., TP₈) minimizes single-request latency.
- At high concurrency, hybrid configurations (e.g., EP₂–TP₄) may sacrifice per-request latency for higher aggregate throughput, enabling more replicas. Throughput gains of 1.5× are observed for Mixture-of-Experts architectures (Yao et al., 2024).
- Adaptive Quantization (FP16/FP8): Reduces both model footprint and memory bandwidth, delivering 20–30% lower inference latency for negligible accuracy loss (Yao et al., 2024).
- PagedAttention + FlashAttention: PagedAttention achieves fine-grained KV-cache block allocation with minimal waste; FlashAttention fuses matmul and softmax, raising stream throughput by 1.3× and preventing out-of-memory errors for long contexts (Yao et al., 2024, Kwon et al., 2023).
- Continuous Batching with MaxUtil Scheduler: Dynamically groups in-flight requests into maximally sized batches under memory constraints. This yields up to 30% higher GPU utilization at peak QPS, but may incur queue backpressure if not tuned.
4. Quantitative Goodput Improvements
ScaleLLM achieves substantial empirical gains:
- Engine-Only vs. vLLM Baseline: At low concurrency, engine latency is reduced up to 2.5×, throughput improved by 1.8–2.2×.
- End-to-End, Full Pipeline vs. vLLM Endpoint:
- At 64 QPS: Throughput = 360 tokens/sec (ScaleLLM) vs. 85 tokens/sec (vLLM) ⇒ 4.3× speed-up.
- Average latency reduced ~1,100 ms → 260 ms.
- Per-Token Streaming: TTFT improved 2.9× (285 ms → 99 ms), TBT improved 4.3× (70 ms → 16.5 ms).
- Real-World Endpoints: Outperforms Fireworks AI and Together AI at high concurrency, delivering 1.5× higher total throughput (Yao et al., 2024).
These gains are a direct result of coordinated engine-side and gateway-side optimizations informed by end-to-end profiling and bottleneck analysis.
5. Trade-Offs, Limitations, and Deployment Scenarios
- Aggressive Scheduling: The MaxUtil batching policy increases average throughput but can stall new requests due to KV-cache exhaustion—thus, tuning and back-pressure thresholds is critical in production.
- Hybrid vs. Static Parallelism: Hybrid EP₂–TP₄ is superior at high QPS, but at QPS < 16, pure TP₈ may offer lower 99th-percentile latency. Dynamic routing between topologies is recommended.
- Quantization and Paging: While highly frugal with resources, extreme quantization can necessitate task-specific fine-tuning to preserve model quality.
- Gateway Complexity: Adoption of a Rust/gRPC stack requires increased engineering and operational expertise. Best suited to teams prepared to maintain such infrastructure.
- Real-World Scope: ScaleLLM’s systemic methodology is applicable to modern, large-scale, user-facing LLM deployments—especially those facing highly concurrent, bursty workloads where per-request user latency is correlated with revenue or retention (Yao et al., 2024).
6. Broader Context and Comparative Approaches
- Dynamic Disaggregation: Complementary to ScaleLLM is dynamic PD-disaggregation (e.g., DOPD), which analytically optimizes the ratio of specialized prefill (P) versus decode (D) instances as load and input/output length distributions shift, yielding up to 1.5× higher goodput and reducing high-percentile latency by 20–67% over static or aggregated systems (Liao et al., 26 Nov 2025).
- Advanced Scheduling and Memory Management: Approaches such as PagedAttention and vLLM further reduce KV-cache overhead, supporting 2–4× or higher throughput increase by enabling flexible sharing and avoiding fragmentation (Kwon et al., 2023).
- Simulation-Guided Planning: The BestServe framework allows rapid simulation-based enumeration of resource allocation strategies under joint SLO constraints, guiding practitioners to high-goodput architectures in minutes rather than hours of benchmark-driven tuning (Hu et al., 6 Jun 2025).
These advances collectively shape an ecosystem wherein maximizing end-to-end goodput is achieved by holistic, multi-stage optimization, aware of both hardware-level constraints and application-centric service level objectives.
References:
ScaleLLM: (Yao et al., 2024) DOPD: (Liao et al., 26 Nov 2025) PagedAttention/vLLM: (Kwon et al., 2023) DistServe: (Zhong et al., 2024) BestServe: (Hu et al., 6 Jun 2025) Smooth Goodput Metrics: (Wang et al., 2024)