Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vidur Framework: LLM Inference Simulation

Updated 20 January 2026
  • Vidur Framework is a high-fidelity simulation framework designed to estimate and optimize production-scale LLM inference using hybrid experimental profiling and predictive modeling.
  • It decomposes the LLM inference process into model onboarding, runtime performance estimation, and multi-tier scheduling, providing detailed metrics on latency, throughput, and resource usage.
  • Vidur-Search automates configuration search over parallelism, batching, and scheduling parameters, reducing empirical tuning efforts and achieving significant cost savings.

Vidur is a high-fidelity, large-scale simulation framework designed to estimate and optimize the performance of LLM inference deployments. Leveraging a hybrid of experimental profiling and predictive modeling, Vidur predicts key metrics—including latency, throughput, and resource utilization—across a broad configuration space defined by system-level parameters such as parallelization, batching, and scheduling. The framework enables efficient configuration search and cost optimization through Vidur-Search, dramatically reducing the hardware and time resources required for empirical performance tuning in production-scale LLM deployments (Agrawal et al., 2024).

1. Architectural Overview and Core Components

Vidur decomposes the LLM inference stack into three hierarchical stages:

  1. Model Onboarding and Profiling: Accepts a declarative model specification encompassing attributes such as layer count, embedding size, and attention mechanisms. Vidur extracts a minimal set of LLM compute operators: token-level kernels (e.g., GEMMs, elementwise activations), sequence-level kernels (attention), and communication kernels (All-Reduce, All-Gather, Send/Recv), which are subject to empirical profiling.
  2. Runtime Performance Estimation: Utilizing profiling results, Vidur trains lightweight random-forest regressors to model operator latency and throughput as functions of operator parameters and parallelism settings. These regressors are distilled into per-operator lookup tables or cost models for fast runtime queries during simulation.
  3. Event-driven Simulation and Hierarchical Scheduling: Vidur's multi-tier scheduler consists of:
    • A global scheduler that allocates incoming requests to LLM service replicas (with support for round-robin, least-loaded, and stateful routing).
    • A replica scheduler responsible for batching (with plug-in policies including vLLM, Orca, Sarathi-Serve, FasterTransformer, and LightLLM), KV-Cache management, and microbatch formation.
    • A pipeline-stage scheduler that schedules microbatches across devices per parallelism strategy (e.g., synchronous pipeline parallelism).

The simulator generates detailed metrics at operator, request, and cluster levels, including Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), end-to-end latency, FLOP utilization, and memory usage.

2. System Knobs and Scheduling Policy Modeling

Vidur explicitly models key system parameters ("knobs") relevant to LLM inference optimization:

  • Parallelization: Tensor Parallel (TP) splits, Pipeline Parallel (PP) stages, and replication. Device mapping per operator is inferred automatically from the model specification.
  • Batching Techniques: Maximum batch size, batching time window, prefill vs. decode token interleaving.
  • Scheduling Policies: Eager prefill (maximizing throughput), decode-prioritizing (minimizing latency), hybrid (Sarathi-Serve), and toggles for kernel fusion and CUDA-graphs.

These modeling choices allow accurate simulation of state-of-the-art LLM serving systems and benchmarking of deployment strategies.

3. Operator Performance Modeling

Vidur's operator modeling consists of an empirical-profiling and predictive-modeling pipeline:

  • Profiling: Token-level operators (e.g., matrix multiplies) are profiled over a grid of (total_tokens, TP_shards) using CUPTI instrumentation on a single large GPU. Sequence-level operators (attention) are split into prefill and decode regimes:
    • Prefill attention workloads scale with the sum of squared prompt lengths ∑i=1Ppi2\sum_{i=1}^P p_i^2 for a batch of PP requests, collapsed to Leq=∑pi2L_{eq} = \sqrt{\sum p_i^2}.
    • Decode attention is profiled as a function of the total KV-Cache size VV and TP shard count.
    • Communication kernels (All-Reduce, All-Gather) are profiled with respect to data size and system topology.
  • Predictive Modeling: Separate random-forest regressors are trained per operator class, mapping input vectors xopx_{op} (e.g., prompt length pp, KV-Cache bytes VV, batch size bb, number of shards ss) to latency (L^op\hat{L}_{op}) and throughput metrics.

Representative learned models include:

L^gemm(b,dmodel,s)≈α1b⋅dmodels+α2b+α3\hat{L}_{\text{gemm}}(b, d_{model}, s) \approx \alpha_1 \frac{b \cdot d_{model}}{s} + \alpha_2 b + \alpha_3

L^attn, prefill(Leq,b,s)≈β1Leq2s+β2Leq+β3\hat{L}_{\text{attn, prefill}}(L_{eq}, b, s) \approx \beta_1 \frac{L_{eq}^2}{s} + \beta_2 L_{eq} + \beta_3

4. End-to-End Inference Performance Estimation

Vidur integrates operator models into a discrete-event simulation to compute request- and system-level performance:

  • Per-iteration Latency:

    • Prefill Phase: For request rr,

    Lprefill(r)=L^attn,prefill(Leq,b,s)+∑op∈token-levelL^op(b⋅pr,s)+comm. costsL_{prefill}(r) = \hat{L}_{attn, prefill}(L_{eq}, b, s) + \sum_{op \in \text{token-level}} \hat{L}_{op}(b \cdot p_r, s) + \text{comm. costs} - Decode Phase: For each decode step kk,

    Ldecode,k(r)=L^attn,decode(Vk,b,s)+∑op∈token-levelL^op(1,s)+comm. costsL_{decode, k}(r) = \hat{L}_{attn, decode}(V_k, b, s) + \sum_{op \in \text{token-level}} \hat{L}_{op}(1, s) + \text{comm. costs}

  • End-to-End Latency:

Latency(r)=scheduling delay(r)+Lprefill(r)+∑k=1NLdecode,k(r)\text{Latency}(r) = \text{scheduling delay}(r) + L_{prefill}(r) + \sum_{k=1}^N L_{decode, k}(r)

  • Throughput and Capacity:

With traffic governed by static or Poisson processes, Vidur identifies the sustainable arrival rate λ∗\lambda^* fulfilling P99(scheduling delay)≤ΔthreshP_{99}(\text{scheduling delay}) \leq \Delta_{thresh} (e.g., 5 s5~\text{s}), which defines system capacity.

  • Cost Metrics:

Cost(C)=∑i∈replicas(#GPUsi⋅cGPU⋅Tsim)\text{Cost}(C) = \sum_{i \in \text{replicas}} (\#\text{GPUs}_i \cdot c_{GPU} \cdot T_{sim})

Throughput per dollar is computed as λ∗/Cost(C)\lambda^* / \text{Cost}(C).

5. Fidelity Validation

Vidur has been validated on four LLMs (LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B) across three workload traces:

  • Static Trace Fidelity: Median execution-time error (P50P_{50}) is less than 2.5%, and the P95P_{95} tail error is below 3.33% (excluding queueing delay).
  • Dynamic Trace Fidelity: With Poisson arrivals at 85% of simulated capacity, the median normalized latency error is below 3%, P95P_{95} is under 5%. At 95% capacity, larger models maintain error within 5%; the 7B model exhibits up to 12.7% error, attributed to CPU overhead cascades. Across all configurations, end-to-end latency prediction error is under 9%.

6. Vidur-Search: Automated Optimization

Vidur-Search automates constrained configuration search over the LLM inference configuration space:

  • Inputs: Model, workload trace, SLOs (e.g., TTFTP90≤2 sTTFT_{P90} \leq 2~\text{s}, TBTP99≤200 msTBT_{P99} \leq 200~\text{ms}), GPU types, and replica count per GPU.
  • Search Space: Parallelism (TP ∈{1,2,4}\in \{1,2,4\}, PP ∈{1,2,4}\in \{1,2,4\}, replication), batch size (∈{32,64,128,256,512}\in \{32,64,128,256,512\}), chunk size, scheduler, and hardware (A100/H100).
  • Optimization Objective: Maximize QPS-per-dollar (λ∗(C)/Cost(C)\lambda^*(C)/\text{Cost}(C)) while satisfying SLOs.
  • Algorithm:
  1. Enumerate all valid configurations.
  2. For each configuration CC, identify λ∗(C)\lambda^*(C) by binary search: simulate arrival rates using Vidur until P99P_{99}(queueing delay) meets threshold.
  3. Discard configurations violating SLOs at λ∗\lambda^*.
  4. Return the configuration maximizing λ∗(C)/Cost(C)\lambda^*(C)/\text{Cost}(C) and, if desired, the Pareto frontier between TTFT and TBT.

Optimization cost is calculated as:

Cost(C)=∑r=1R(#GPUsr×rateGPU SKU×simulation_time)\text{Cost}(C) = \sum_{r=1}^{R} (\#\text{GPUs}_r \times \text{rate}_{\text{GPU SKU}} \times \text{simulation\_time})

7. Case Study and Extensibility

LLaMA2-70B Deployment:

  • Direct hardware-based configuration search would require approximately 42,000 GPU-hours (≈\$218,000).
  • Vidur-Search, executed on a 96-core CPU node, completed the search within 1 hour (≈\$10 CPU cost).
  • The optimal deployment identified (e.g., TP=4, PP=2, Sarathi-Serve, batch size=256 on H100) achieved QPS-per-dollar ≈ 0.13 QPS/\$. This corresponded to a roughly 2× cost reduction relative to a suboptimal configuration and revealed trade-offs between TTFT and TBT.

Extensibility:

  • New LLMs are supported through the declarative model specification; new operators expand the profiling set.
  • New hardware targets are enabled by re-profiling base kernels and retraining random-forest models.
  • Custom batching and scheduling algorithms are implemented as Python callbacks (≈150 LOC per policy), supporting rapid policy prototyping.

Vidur enables reduction of months-long empirical GPU experimentation to hours on CPU resources, with high fidelity to real-world deployment metrics. Coupled with Vidur-Search, the framework offers a systematic, cost-effective approach to production-scale LLM inference configuration (Agrawal et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidur Framework.