Papers
Topics
Authors
Recent
Search
2000 character limit reached

MLPerf LoadGen Overview

Updated 28 January 2026
  • MLPerf LoadGen is a core library that generates realistic query traffic and measures ML inference performance with strict, reproducible benchmarking standards.
  • It decouples workload generation from ML inference code, enabling easy integration with various engines while enforcing rules like tail-latency and query drop constraints.
  • The tool supports multiple deployment scenarios (SingleStream, MultiStream, Server, Offline) using minimal thread models for accurate, statistically meaningful performance metrics.

MLPerf LoadGen is the core traffic-generation and measurement library of the MLPerf Inference benchmark, designed for architecture-neutral, reproducible, and fair evaluation of diverse ML inference systems. Implemented as a standalone C++ module, LoadGen decouples workload generation from ML inference code, supporting multiple realistic arrival scenarios, rule enforcement, and computation of statistically meaningful performance metrics. LoadGen's APIs and internal constructs enable seamless integration of new inference engines and datasets while ensuring strict adherence to MLPerf benchmarking standards, including tail-latency confidence bounds, untimed preprocessing, and scenario-specific constraints (Reddi et al., 2019).

1. Architectural Principles and Design Objectives

LoadGen operates as a black-box module that centrally manages query arrival patterns, timing, and performance measurement, while participant code implements ML inference and sample pre-processing/post-processing. Key design objectives include:

  • Fairness and Repeatability: Ensures fair, architecture-agnostic measurement across disparate hardware and software stacks.
  • Separation of Concerns: LoadGen handles timing and queuing; submitter code manages inference and all aspects of post-processing.
  • Flexible Scenarios: Supports plug-in workload patterns (SingleStream, MultiStream, Server, Offline) representing common ML deployment modes.
  • Automatic Rule Enforcement: Enforces MLPerf rules—such as untimed preprocessing, strict query counts, drop-rate and tail-latency checks—at the LoadGen boundary.
  • Support for Closed and Open Divisions: Accommodates both fixed-reference and arbitrary models for broad applicability.

Internally, LoadGen employs a minimal thread model tailored to each scenario. It manages all timing and query issuance, invokes user-implemented interfaces for system-under-test (SUT) and query sample library (QSL), captures timestamps, and computes aggregate metrics and statistical validity checks upon test completion (Reddi et al., 2019).

2. Core Abstractions and API Constructs

LoadGen exposes a small set of key C++ classes and utilities that encapsulate the interaction between benchmark infrastructure and user code:

  • TestSettings: Plain-old-data structure specifying scenario, scenario-specific parameters (e.g., latency, QPS targets), minimum query and duration constraints, accuracy/performance mode flags, and PRNG seed. It can be loaded from INI-style or YAML configuration files.
  • QuerySample: Structure containing a logical sample index and globally-unique per-query identifier, dispatched to SUT implementations for processing.
  • QuerySampleResponse: Structure correlating a QuerySample with result buffers, used by SUT code to report inference completions.
  • QuerySampleLatency: Structure mapping a QuerySample to measured wall-clock latency in nanoseconds.
  • QuerySampleLibrary (QSL): Abstract interface representing the dataset or input sample space; requires methods for bulk sample loading/unloading (outside timed measurement), sample enumeration, and performance sample selection.
  • SystemUnderTest (SUT): Abstract interface encapsulating the inference system. Requires implementations for naming, non-blocking query issue (asynchronous), flushing outstanding inference, and optionally, direct reporting of latency measurements.
  • LogSettings and TestScenario: Helpers for logging output configuration and scenario flagging.

Typical integration involves subclassing QSL and SUT, configuring TestSettings, optionally adjusting LogSettings, and launching the benchmark via the StartTest function call, which orchestrates the measurement run and logging (Reddi et al., 2019).

3. Scenario Models and Their Realization

LoadGen operationalizes four canonical inference deployment modes, each with distinct query-arrival semantics, batching, and metrics:

Scenario Query Arrival/Batching Primary Metric
SingleStream 1 sample/query; strictly serialized, each after prior completes 90th-percentile latency (p90p_{90})
MultiStream N contiguous samples/query; queries at fixed interval Δt\Delta t; ≤1% drop allowed Max # streams at ≤1% drops, ≤ tail-latency
Server 1 sample/query; arrivals as Poisson process at rate λ=\lambda=targetQps Maximum sustainable λ\lambda s.t. ≤1% at tail-latency bound
Offline All performance samples in a single batch query Throughput: samples/sec
  • SingleStream: Implements a tight loop in a single thread, ensuring sequential query dispatch and precise latency measurement.
  • MultiStream: Uses either one thread per stream or a multiplexing driver, enforcing deadlines and managing drop fractions for “missed” query slots.
  • Server: Deploys a scheduler thread issuing single-sample queries at exponentially distributed inter-arrival intervals using Δ=ln(U)/λ\Delta = -\ln(U)/\lambda, where UU \sim Uniform(0,1).
  • Offline: Batches all samples in a single query for maximum throughput. Untimed sample loading occurs before the timed window.

Each scenario maintains counters for issued, completed, and dropped queries, with rigorous scenario-specific enforcement and rejection of non-compliant runs (Reddi et al., 2019).

4. Threading Model, Query Generation, and Workflows

LoadGen's internal threading and scheduling are strictly scenario-driven:

  • SingleStream: Single-threaded loop, with IssueQuery and blocking for callback completion per sample.
  • MultiStream: Either multiple threads (one per stream) or a controller managing deadlines and batch drops, where the per-stream “next issue time” is enforced by last issue + Δt\Delta t.
  • Server: Scheduler thread samples inter-arrival times via the Poisson process, issues queries accordingly, and observes an outstanding-queries cap.
  • Offline: Single bulk IssueQuery of all performance samples.

Integration workflow proceeds as:

  1. LoadGen calls QSL::LoadSamplesToRam for performance sample indices (untimed).
  2. Timed window begins; queries are issued to the SUT according to scenario.
  3. SUT enqueues for asynchronous inference, calls QuerySamplesComplete on completion.
  4. LoadGen logs timestamps, aggregates metrics.
  5. Post-run, SUT::FlushQueries is invoked, and results logged in JSON.

This modular thread structure ensures both reproducibility and compliance with MLPerf's rigorous measurement standards (Reddi et al., 2019).

5. Metrics, Statistical Validity, and Rule Enforcement

LoadGen computes and enforces a suite of metrics and best-practice requirements:

  • Latency Metrics: Percentiles (e.g., p90p_{90} for SingleStream, p99p_{99} for Server), computed from IssueQuery-to-callback timings per sample.
  • Throughput Metrics: For Offline scenario, calculated as total performance samples divided by wall-clock duration.
  • QPS and Stream Count: Maximum sustainable query rate or largest number of concurrent streams, subject to the scenario's drop/tail-latency constraints.
  • Statistical Validity: Minimum query count NminN_{min} computed via the normal approximation to the binomial confidence interval. For example, for tail percentile T=0.90T=0.90, confidence C=0.99C=0.99, error M=(1T)/20M=(1-T)/20, z=Φ1(1(1C)/2)z=\Phi^{-1}(1-(1-C)/2), Nmin=z2T(1T)/M2N_{min}=z^2T(1-T)/M^2, typically leading to minimums such as 24,576 queries per run.

Additional enforcement includes:

  • Untimed Preprocessing: QSL sample (un)loading is intentionally outside measurement intervals to prevent skew.
  • Randomization and Cheating Prevention: Sample selection randomized via public seed; alternate-seed testing resists hardcoded optimizations.
  • Drop/Tail-Latency Boundaries: Strict ≤1% query drop (MultiStream), 1% tail-bound misses (Server), with relaxed 3% for specific models (GNMT).
  • Caching Checks: Duplicate sample IDs in validation runs detect improper caching.

Runs that violate statistical or rule thresholds are invalidated by LoadGen and omitted from official results reporting (Reddi et al., 2019).

6. Minimal Integration and Example Usage

A minimal C++ example using the Offline scenario illustrates the required integration points:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include "loadgen.h"
using namespace mlperf;

struct TrivialQSL : public QuerySampleLibrary {
  size_t GetNumSamples() const override { return 1024; }
  size_t GetPerformanceSampleCount() const override { return 1024; }
  void LoadSamplesToRam(const std::vector<uint64_t>& ids) override { /* decode here */ }
  void UnloadSamplesFromRam(const std::vector<uint64_t>& ids) override { /* free here */ }
};

struct TrivialSUT : public SystemUnderTest {
  const char* Name() const override { return "TrivialSUT"; }
  void IssueQuery(const std::vector<QuerySample>& samples) override {
    std::vector<QuerySampleResponse> responses;
    responses.reserve(samples.size());
    double now = GetTimeNs();
    for (auto &s : samples) {
      responses.push_back({s.id, /*data=*/0, /*size=*/0});
    }
    QuerySamplesComplete(responses);
  }
  void FlushQueries() override {}
};

int main() {
  TestSettings settings;
  settings.scenario = TestScenario::Offline;
  settings.min_duration_ms = 60000;
  settings.min_query_count  = 1;

  TrivialQSL qsl;
  TrivialSUT sut;
  LogSettings log{ "trivial_offline", false };
  StartTest(&sut, &qsl, settings, log);

  return 0;
}

Compiling and running this integration with LoadGen produces JSON output summarizing samples/sec, model name, scenario, and accuracy pass/fail. This minimal workflow demonstrates LoadGen's requirement for explicit QSL/SUT subclassing, event-driven query-complete notification, and configuration via TestSettings (Reddi et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MLPerf LoadGen.