Vidur Framework: LLM Inference Simulation

Updated 20 January 2026

Vidur Framework is a high-fidelity simulation framework designed to estimate and optimize production-scale LLM inference using hybrid experimental profiling and predictive modeling.
It decomposes the LLM inference process into model onboarding, runtime performance estimation, and multi-tier scheduling, providing detailed metrics on latency, throughput, and resource usage.
Vidur-Search automates configuration search over parallelism, batching, and scheduling parameters, reducing empirical tuning efforts and achieving significant cost savings.

Vidur is a high-fidelity, large-scale simulation framework designed to estimate and optimize the performance of LLM inference deployments. Leveraging a hybrid of experimental profiling and predictive modeling, Vidur predicts key metrics—including latency, throughput, and resource utilization—across a broad configuration space defined by system-level parameters such as parallelization, batching, and scheduling. The framework enables efficient configuration search and cost optimization through Vidur-Search, dramatically reducing the hardware and time resources required for empirical performance tuning in production-scale LLM deployments (Agrawal et al., 2024).

1. Architectural Overview and Core Components

Vidur decomposes the LLM inference stack into three hierarchical stages:

Model Onboarding and Profiling: Accepts a declarative model specification encompassing attributes such as layer count, embedding size, and attention mechanisms. Vidur extracts a minimal set of LLM compute operators: token-level kernels (e.g., GEMMs, elementwise activations), sequence-level kernels (attention), and communication kernels (All-Reduce, All-Gather, Send/Recv), which are subject to empirical profiling.
Runtime Performance Estimation: Utilizing profiling results, Vidur trains lightweight random-forest regressors to model operator latency and throughput as functions of operator parameters and parallelism settings. These regressors are distilled into per-operator lookup tables or cost models for fast runtime queries during simulation.
Event-driven Simulation and Hierarchical Scheduling: Vidur's multi-tier scheduler consists of:
- A global scheduler that allocates incoming requests to LLM service replicas (with support for round-robin, least-loaded, and stateful routing).
- A replica scheduler responsible for batching (with plug-in policies including vLLM, Orca, Sarathi-Serve, FasterTransformer, and LightLLM), KV-Cache management, and microbatch formation.
- A pipeline-stage scheduler that schedules microbatches across devices per parallelism strategy (e.g., synchronous pipeline parallelism).

The simulator generates detailed metrics at operator, request, and cluster levels, including Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), end-to-end latency, FLOP utilization, and memory usage.

2. System Knobs and Scheduling Policy Modeling

Vidur explicitly models key system parameters ("knobs") relevant to LLM inference optimization:

Parallelization: Tensor Parallel (TP) splits, Pipeline Parallel (PP) stages, and replication. Device mapping per operator is inferred automatically from the model specification.
Batching Techniques: Maximum batch size, batching time window, prefill vs. decode token interleaving.
Scheduling Policies: Eager prefill (maximizing throughput), decode-prioritizing (minimizing latency), hybrid (Sarathi-Serve), and toggles for kernel fusion and CUDA-graphs.

These modeling choices allow accurate simulation of state-of-the-art LLM serving systems and benchmarking of deployment strategies.

3. Operator Performance Modeling

Vidur's operator modeling consists of an empirical-profiling and predictive-modeling pipeline:

Profiling: Token-level operators (e.g., matrix multiplies) are profiled over a grid of (total_tokens, TP_shards) using CUPTI instrumentation on a single large GPU. Sequence-level operators (attention) are split into prefill and decode regimes:
- Prefill attention workloads scale with the sum of squared prompt lengths $\sum_{i=1}^P p_i^2$ for a batch of $P$ requests, collapsed to $L_{eq} = \sqrt{\sum p_i^2}$ .
- Decode attention is profiled as a function of the total KV-Cache size $V$ and TP shard count.
- Communication kernels (All-Reduce, All-Gather) are profiled with respect to data size and system topology.
Predictive Modeling: Separate random-forest regressors are trained per operator class, mapping input vectors $x_{op}$ (e.g., prompt length $p$ , KV-Cache bytes $V$ , batch size $b$ , number of shards $s$ ) to latency ( $\hat{L}_{op}$ ) and throughput metrics.

Representative learned models include:

$P$ 0

$P$ 1

4. End-to-End Inference Performance Estimation

Vidur integrates operator models into a discrete-event simulation to compute request- and system-level performance:

Per-iteration Latency:
- Prefill Phase: For request $P$ 2,
$P$ 3 - Decode Phase: For each decode step $P$ 4,

$P$ 5
End-to-End Latency:

$P$ 6

Throughput and Capacity:

With traffic governed by static or Poisson processes, Vidur identifies the sustainable arrival rate $P$ 7 fulfilling $P$ 8 (e.g., $P$ 9), which defines system capacity.

Cost Metrics:

$L_{eq} = \sqrt{\sum p_i^2}$ 0

Throughput per dollar is computed as $L_{eq} = \sqrt{\sum p_i^2}$ 1.

5. Fidelity Validation

Vidur has been validated on four LLMs (LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B) across three workload traces:

Static Trace Fidelity: Median execution-time error ( $L_{eq} = \sqrt{\sum p_i^2}$ 2) is less than 2.5%, and the $L_{eq} = \sqrt{\sum p_i^2}$ 3 tail error is below 3.33% (excluding queueing delay).
Dynamic Trace Fidelity: With Poisson arrivals at 85% of simulated capacity, the median normalized latency error is below 3%, $L_{eq} = \sqrt{\sum p_i^2}$ 4 is under 5%. At 95% capacity, larger models maintain error within 5%; the 7B model exhibits up to 12.7% error, attributed to CPU overhead cascades. Across all configurations, end-to-end latency prediction error is under 9%.

6. Vidur-Search: Automated Optimization

Vidur-Search automates constrained configuration search over the LLM inference configuration space:

Inputs: Model, workload trace, SLOs (e.g., $L_{eq} = \sqrt{\sum p_i^2}$ 5, $L_{eq} = \sqrt{\sum p_i^2}$ 6), GPU types, and replica count per GPU.
Search Space: Parallelism (TP $L_{eq} = \sqrt{\sum p_i^2}$ 7, PP $L_{eq} = \sqrt{\sum p_i^2}$ 8, replication), batch size ( $L_{eq} = \sqrt{\sum p_i^2}$ 9), chunk size, scheduler, and hardware (A100/H100).
Optimization Objective: Maximize QPS-per-dollar ( $V$ 0) while satisfying SLOs.
Algorithm:

Enumerate all valid configurations.
For each configuration $V$ 1, identify $V$ 2 by binary search: simulate arrival rates using Vidur until $V$ 3(queueing delay) meets threshold.
Discard configurations violating SLOs at $V$ 4.
Return the configuration maximizing $V$ 5 and, if desired, the Pareto frontier between TTFT and TBT.

Optimization cost is calculated as:

$V$ 6

7. Case Study and Extensibility

LLaMA2-70B Deployment:

Direct hardware-based configuration search would require approximately 42,000 GPU-hours (≈\$218,000).
Vidur-Search, executed on a 96-core CPU node, completed the search within 1 hour (≈\$10 CPU cost).
The optimal deployment identified (e.g., TP=4, PP=2, Sarathi-Serve, batch size=256 on H100) achieved QPS-per-dollar ≈ 0.13 QPS/\$. This corresponded to a roughly 2× cost reduction relative to a suboptimal configuration and revealed trade-offs between TTFT and TBT.

Extensibility:

New LLMs are supported through the declarative model specification; new operators expand the profiling set.
New hardware targets are enabled by re-profiling base kernels and retraining random-forest models.
Custom batching and scheduling algorithms are implemented as Python callbacks (≈150 LOC per policy), supporting rapid policy prototyping.

Vidur enables reduction of months-long empirical GPU experimentation to hours on CPU resources, with high fidelity to real-world deployment metrics. Coupled with Vidur-Search, the framework offers a systematic, cost-effective approach to production-scale LLM inference configuration (Agrawal et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Vidur: A Large-Scale Simulation Framework For LLM Inference (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidur Framework.