Vidur Framework: LLM Inference Simulation
- Vidur Framework is a high-fidelity simulation framework designed to estimate and optimize production-scale LLM inference using hybrid experimental profiling and predictive modeling.
- It decomposes the LLM inference process into model onboarding, runtime performance estimation, and multi-tier scheduling, providing detailed metrics on latency, throughput, and resource usage.
- Vidur-Search automates configuration search over parallelism, batching, and scheduling parameters, reducing empirical tuning efforts and achieving significant cost savings.
Vidur is a high-fidelity, large-scale simulation framework designed to estimate and optimize the performance of LLM inference deployments. Leveraging a hybrid of experimental profiling and predictive modeling, Vidur predicts key metrics—including latency, throughput, and resource utilization—across a broad configuration space defined by system-level parameters such as parallelization, batching, and scheduling. The framework enables efficient configuration search and cost optimization through Vidur-Search, dramatically reducing the hardware and time resources required for empirical performance tuning in production-scale LLM deployments (Agrawal et al., 2024).
1. Architectural Overview and Core Components
Vidur decomposes the LLM inference stack into three hierarchical stages:
- Model Onboarding and Profiling: Accepts a declarative model specification encompassing attributes such as layer count, embedding size, and attention mechanisms. Vidur extracts a minimal set of LLM compute operators: token-level kernels (e.g., GEMMs, elementwise activations), sequence-level kernels (attention), and communication kernels (All-Reduce, All-Gather, Send/Recv), which are subject to empirical profiling.
- Runtime Performance Estimation: Utilizing profiling results, Vidur trains lightweight random-forest regressors to model operator latency and throughput as functions of operator parameters and parallelism settings. These regressors are distilled into per-operator lookup tables or cost models for fast runtime queries during simulation.
- Event-driven Simulation and Hierarchical Scheduling: Vidur's multi-tier scheduler consists of:
- A global scheduler that allocates incoming requests to LLM service replicas (with support for round-robin, least-loaded, and stateful routing).
- A replica scheduler responsible for batching (with plug-in policies including vLLM, Orca, Sarathi-Serve, FasterTransformer, and LightLLM), KV-Cache management, and microbatch formation.
- A pipeline-stage scheduler that schedules microbatches across devices per parallelism strategy (e.g., synchronous pipeline parallelism).
The simulator generates detailed metrics at operator, request, and cluster levels, including Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), end-to-end latency, FLOP utilization, and memory usage.
2. System Knobs and Scheduling Policy Modeling
Vidur explicitly models key system parameters ("knobs") relevant to LLM inference optimization:
- Parallelization: Tensor Parallel (TP) splits, Pipeline Parallel (PP) stages, and replication. Device mapping per operator is inferred automatically from the model specification.
- Batching Techniques: Maximum batch size, batching time window, prefill vs. decode token interleaving.
- Scheduling Policies: Eager prefill (maximizing throughput), decode-prioritizing (minimizing latency), hybrid (Sarathi-Serve), and toggles for kernel fusion and CUDA-graphs.
These modeling choices allow accurate simulation of state-of-the-art LLM serving systems and benchmarking of deployment strategies.
3. Operator Performance Modeling
Vidur's operator modeling consists of an empirical-profiling and predictive-modeling pipeline:
- Profiling: Token-level operators (e.g., matrix multiplies) are profiled over a grid of (total_tokens, TP_shards) using CUPTI instrumentation on a single large GPU. Sequence-level operators (attention) are split into prefill and decode regimes:
- Prefill attention workloads scale with the sum of squared prompt lengths for a batch of requests, collapsed to .
- Decode attention is profiled as a function of the total KV-Cache size and TP shard count.
- Communication kernels (All-Reduce, All-Gather) are profiled with respect to data size and system topology.
- Predictive Modeling: Separate random-forest regressors are trained per operator class, mapping input vectors (e.g., prompt length , KV-Cache bytes , batch size , number of shards ) to latency () and throughput metrics.
Representative learned models include:
4. End-to-End Inference Performance Estimation
Vidur integrates operator models into a discrete-event simulation to compute request- and system-level performance:
- Per-iteration Latency:
- Prefill Phase: For request ,
- Decode Phase: For each decode step ,
- End-to-End Latency:
- Throughput and Capacity:
With traffic governed by static or Poisson processes, Vidur identifies the sustainable arrival rate fulfilling (e.g., ), which defines system capacity.
- Cost Metrics:
Throughput per dollar is computed as .
5. Fidelity Validation
Vidur has been validated on four LLMs (LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B) across three workload traces:
- Static Trace Fidelity: Median execution-time error () is less than 2.5%, and the tail error is below 3.33% (excluding queueing delay).
- Dynamic Trace Fidelity: With Poisson arrivals at 85% of simulated capacity, the median normalized latency error is below 3%, is under 5%. At 95% capacity, larger models maintain error within 5%; the 7B model exhibits up to 12.7% error, attributed to CPU overhead cascades. Across all configurations, end-to-end latency prediction error is under 9%.
6. Vidur-Search: Automated Optimization
Vidur-Search automates constrained configuration search over the LLM inference configuration space:
- Inputs: Model, workload trace, SLOs (e.g., , ), GPU types, and replica count per GPU.
- Search Space: Parallelism (TP , PP , replication), batch size (), chunk size, scheduler, and hardware (A100/H100).
- Optimization Objective: Maximize QPS-per-dollar () while satisfying SLOs.
- Algorithm:
- Enumerate all valid configurations.
- For each configuration , identify by binary search: simulate arrival rates using Vidur until (queueing delay) meets threshold.
- Discard configurations violating SLOs at .
- Return the configuration maximizing and, if desired, the Pareto frontier between TTFT and TBT.
Optimization cost is calculated as:
7. Case Study and Extensibility
LLaMA2-70B Deployment:
- Direct hardware-based configuration search would require approximately 42,000 GPU-hours (≈\$218,000).
- Vidur-Search, executed on a 96-core CPU node, completed the search within 1 hour (≈\$10 CPU cost).
- The optimal deployment identified (e.g., TP=4, PP=2, Sarathi-Serve, batch size=256 on H100) achieved QPS-per-dollar ≈ 0.13 QPS/\$. This corresponded to a roughly 2× cost reduction relative to a suboptimal configuration and revealed trade-offs between TTFT and TBT.
Extensibility:
- New LLMs are supported through the declarative model specification; new operators expand the profiling set.
- New hardware targets are enabled by re-profiling base kernels and retraining random-forest models.
- Custom batching and scheduling algorithms are implemented as Python callbacks (≈150 LOC per policy), supporting rapid policy prototyping.
Vidur enables reduction of months-long empirical GPU experimentation to hours on CPU resources, with high fidelity to real-world deployment metrics. Coupled with Vidur-Search, the framework offers a systematic, cost-effective approach to production-scale LLM inference configuration (Agrawal et al., 2024).