Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vidur Framework: LLM Inference Simulation

Updated 20 January 2026
  • Vidur Framework is a high-fidelity simulation framework designed to estimate and optimize production-scale LLM inference using hybrid experimental profiling and predictive modeling.
  • It decomposes the LLM inference process into model onboarding, runtime performance estimation, and multi-tier scheduling, providing detailed metrics on latency, throughput, and resource usage.
  • Vidur-Search automates configuration search over parallelism, batching, and scheduling parameters, reducing empirical tuning efforts and achieving significant cost savings.

Vidur is a high-fidelity, large-scale simulation framework designed to estimate and optimize the performance of LLM inference deployments. Leveraging a hybrid of experimental profiling and predictive modeling, Vidur predicts key metrics—including latency, throughput, and resource utilization—across a broad configuration space defined by system-level parameters such as parallelization, batching, and scheduling. The framework enables efficient configuration search and cost optimization through Vidur-Search, dramatically reducing the hardware and time resources required for empirical performance tuning in production-scale LLM deployments (Agrawal et al., 2024).

1. Architectural Overview and Core Components

Vidur decomposes the LLM inference stack into three hierarchical stages:

  1. Model Onboarding and Profiling: Accepts a declarative model specification encompassing attributes such as layer count, embedding size, and attention mechanisms. Vidur extracts a minimal set of LLM compute operators: token-level kernels (e.g., GEMMs, elementwise activations), sequence-level kernels (attention), and communication kernels (All-Reduce, All-Gather, Send/Recv), which are subject to empirical profiling.
  2. Runtime Performance Estimation: Utilizing profiling results, Vidur trains lightweight random-forest regressors to model operator latency and throughput as functions of operator parameters and parallelism settings. These regressors are distilled into per-operator lookup tables or cost models for fast runtime queries during simulation.
  3. Event-driven Simulation and Hierarchical Scheduling: Vidur's multi-tier scheduler consists of:
    • A global scheduler that allocates incoming requests to LLM service replicas (with support for round-robin, least-loaded, and stateful routing).
    • A replica scheduler responsible for batching (with plug-in policies including vLLM, Orca, Sarathi-Serve, FasterTransformer, and LightLLM), KV-Cache management, and microbatch formation.
    • A pipeline-stage scheduler that schedules microbatches across devices per parallelism strategy (e.g., synchronous pipeline parallelism).

The simulator generates detailed metrics at operator, request, and cluster levels, including Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), end-to-end latency, FLOP utilization, and memory usage.

2. System Knobs and Scheduling Policy Modeling

Vidur explicitly models key system parameters ("knobs") relevant to LLM inference optimization:

  • Parallelization: Tensor Parallel (TP) splits, Pipeline Parallel (PP) stages, and replication. Device mapping per operator is inferred automatically from the model specification.
  • Batching Techniques: Maximum batch size, batching time window, prefill vs. decode token interleaving.
  • Scheduling Policies: Eager prefill (maximizing throughput), decode-prioritizing (minimizing latency), hybrid (Sarathi-Serve), and toggles for kernel fusion and CUDA-graphs.

These modeling choices allow accurate simulation of state-of-the-art LLM serving systems and benchmarking of deployment strategies.

3. Operator Performance Modeling

Vidur's operator modeling consists of an empirical-profiling and predictive-modeling pipeline:

  • Profiling: Token-level operators (e.g., matrix multiplies) are profiled over a grid of (total_tokens, TP_shards) using CUPTI instrumentation on a single large GPU. Sequence-level operators (attention) are split into prefill and decode regimes:
    • Prefill attention workloads scale with the sum of squared prompt lengths i=1Ppi2\sum_{i=1}^P p_i^2 for a batch of PP requests, collapsed to Leq=pi2L_{eq} = \sqrt{\sum p_i^2}.
    • Decode attention is profiled as a function of the total KV-Cache size VV and TP shard count.
    • Communication kernels (All-Reduce, All-Gather) are profiled with respect to data size and system topology.
  • Predictive Modeling: Separate random-forest regressors are trained per operator class, mapping input vectors xopx_{op} (e.g., prompt length pp, KV-Cache bytes VV, batch size bb, number of shards ss) to latency (L^op\hat{L}_{op}) and throughput metrics.

Representative learned models include:

PP0

PP1

4. End-to-End Inference Performance Estimation

Vidur integrates operator models into a discrete-event simulation to compute request- and system-level performance:

  • Per-iteration Latency:

    • Prefill Phase: For request PP2,

    PP3 - Decode Phase: For each decode step PP4,

    PP5

  • End-to-End Latency:

PP6

  • Throughput and Capacity:

With traffic governed by static or Poisson processes, Vidur identifies the sustainable arrival rate PP7 fulfilling PP8 (e.g., PP9), which defines system capacity.

  • Cost Metrics:

Leq=pi2L_{eq} = \sqrt{\sum p_i^2}0

Throughput per dollar is computed as Leq=pi2L_{eq} = \sqrt{\sum p_i^2}1.

5. Fidelity Validation

Vidur has been validated on four LLMs (LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B) across three workload traces:

  • Static Trace Fidelity: Median execution-time error (Leq=pi2L_{eq} = \sqrt{\sum p_i^2}2) is less than 2.5%, and the Leq=pi2L_{eq} = \sqrt{\sum p_i^2}3 tail error is below 3.33% (excluding queueing delay).
  • Dynamic Trace Fidelity: With Poisson arrivals at 85% of simulated capacity, the median normalized latency error is below 3%, Leq=pi2L_{eq} = \sqrt{\sum p_i^2}4 is under 5%. At 95% capacity, larger models maintain error within 5%; the 7B model exhibits up to 12.7% error, attributed to CPU overhead cascades. Across all configurations, end-to-end latency prediction error is under 9%.

6. Vidur-Search: Automated Optimization

Vidur-Search automates constrained configuration search over the LLM inference configuration space:

  • Inputs: Model, workload trace, SLOs (e.g., Leq=pi2L_{eq} = \sqrt{\sum p_i^2}5, Leq=pi2L_{eq} = \sqrt{\sum p_i^2}6), GPU types, and replica count per GPU.
  • Search Space: Parallelism (TP Leq=pi2L_{eq} = \sqrt{\sum p_i^2}7, PP Leq=pi2L_{eq} = \sqrt{\sum p_i^2}8, replication), batch size (Leq=pi2L_{eq} = \sqrt{\sum p_i^2}9), chunk size, scheduler, and hardware (A100/H100).
  • Optimization Objective: Maximize QPS-per-dollar (VV0) while satisfying SLOs.
  • Algorithm:
  1. Enumerate all valid configurations.
  2. For each configuration VV1, identify VV2 by binary search: simulate arrival rates using Vidur until VV3(queueing delay) meets threshold.
  3. Discard configurations violating SLOs at VV4.
  4. Return the configuration maximizing VV5 and, if desired, the Pareto frontier between TTFT and TBT.

Optimization cost is calculated as:

VV6

7. Case Study and Extensibility

LLaMA2-70B Deployment:

  • Direct hardware-based configuration search would require approximately 42,000 GPU-hours (≈\$218,000).
  • Vidur-Search, executed on a 96-core CPU node, completed the search within 1 hour (≈\$10 CPU cost).
  • The optimal deployment identified (e.g., TP=4, PP=2, Sarathi-Serve, batch size=256 on H100) achieved QPS-per-dollar ≈ 0.13 QPS/\$. This corresponded to a roughly 2× cost reduction relative to a suboptimal configuration and revealed trade-offs between TTFT and TBT.

Extensibility:

  • New LLMs are supported through the declarative model specification; new operators expand the profiling set.
  • New hardware targets are enabled by re-profiling base kernels and retraining random-forest models.
  • Custom batching and scheduling algorithms are implemented as Python callbacks (≈150 LOC per policy), supporting rapid policy prototyping.

Vidur enables reduction of months-long empirical GPU experimentation to hours on CPU resources, with high fidelity to real-world deployment metrics. Coupled with Vidur-Search, the framework offers a systematic, cost-effective approach to production-scale LLM inference configuration (Agrawal et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidur Framework.