Papers
Topics
Authors
Recent
Search
2000 character limit reached

TuneQn: Open-Source ONNX Quantization Suite

Updated 2 July 2026
  • TuneQn is an open-source suite for multi-objective, sensitivity-driven selective quantization of ONNX models.
  • It integrates layer sensitivity analysis, candidate generation, hardware deployment, and Pareto optimization to balance accuracy, model size, and performance.
  • Evaluated on models like MobileNetV2 and ShuffleNetV2, TuneQn demonstrates significant accuracy preservation and model size reduction on varied CPU and GPU platforms.

TuneQn is an open-source suite for systematic, multi-objective selective quantization of ONNX models across CPU and GPU targets. It addresses the problem that full-model quantization frequently induces unacceptable accuracy loss or fails to run efficiently on resource-constrained hardware, while purely manual or uninformed selective quantization is non-scalable. By integrating layer-level sensitivity analysis, stepwise exclusion, deployment on target accelerators, and Pareto front optimization, TuneQn enables principled selection of quantization assignments that best trade off accuracy, model size, and performance metrics for a given backend (Louloudakis et al., 16 Jul 2025).

1. Architectural Overview and Workflow

TuneQn implements a five-stage linear pipeline: model orchestration, activation-driven layer sensitivity analysis, candidate generation via selective quantization, deployment and profiling on selected devices, and Pareto-optimal multi-objective analysis. The workflow is fully automated from ingestion of an ONNX model and a calibration/validation dataset, through layerwise analysis and candidate generation, to per-candidate empirical evaluation and selection of top-Pareto models.

Stage Modules

Module Function Key Outputs
Model Orchestrator Loads and parses ONNX graphs, extracts layer meta-data Layer list, shapes, opset, configuration
Layer Activation Analysis Quantifies sensitivity of each layer to quantization via calibration Layer ranking by normalized sensitivity score
Selective Quantization Excludes k most sensitive layers, generates candidate assignments Partially quantized ONNX models
Runner Deploys models on CPU (ONNX Runtime) / GPU (TVM), benchmarks Measured accuracy, latency, storage
Metrics Benchmarking/Visualizer Computes Pareto front and visualizes summary statistics Plots, metrics, candidate selection

This pipeline ensures scalability, checkpointing, and direct support for hardware-specific profiling.

2. Sensitivity-Driven Selective Quantization Methodology

TuneQn formalizes selective quantization as a layer-ranking and exclusion process. The set of quantizable layers L={L1,,LN}\mathcal{L} = \{L_1, \ldots, L_N\} is evaluated using two activation-based error metrics under full quantization:

  • QDQ-Error: Per-layer discrepancy in "quantize-dequantize" output vs. original FP32 output across a calibration set:

eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|

  • XModel-Error: Relative error in fully quantized kernel output:

eix=1DxDXi(x)    Fi(x)Fi(x)e^{\mathrm{x}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}}\frac{\|X_i(x)\;-\;F_i(x)\|}{\|F_i(x)\|}

These errors are normalized and averaged to a combined sensitivity score e^i\hat{e}_i, providing a quantitative basis for ranking and exclusion:

e^i=0.5e~iqdq+0.5e~ix\hat{e}_i = 0.5\,\tilde e^{\mathrm{qdq}}_i + 0.5\,\tilde e^{\mathrm{x}}_i

Candidates are generated by successively excluding the kk layers with highest e^i\hat{e}_i from quantization, forming models Sk={L1,...,Lk}S_k = \{L_1, ..., L_k\} for k=0...Nk=0...N.

3. Quantization Schemes and ONNX Graph Construction

TuneQn supports static INT8 quantization (weights, biases, activations with calibration), dynamic INT8 quantization (weights/biases quantized, activations quantize-dequantized at runtime), and uniform FP16 conversion. Per-candidate models are constructed by marking ONNX nodes corresponding to layers in SkS_k with standard (FP32) operators and quantizing all others, using appropriate ONNX subgraph nodes (QLinearConv, QLinearMatMul, DequantizeLinear, etc.).

Calibration (for static INT8) uses up to 50 images to derive min/max ranges per channel or per tensor.

4. Multi-Objective Pareto Optimization Strategy

For each candidate eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|0, empirical measurements on the validation set yield:

  • eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|1: Accuracy degradation compared to baseline FP32
  • eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|2: Model storage footprint (MB)
  • Additional metrics: Inference latency, throughput, peak device memory

TuneQn formulates the model selection as a bi-objective minimization:

eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|3

The Pareto front of non-dominated candidates is computed using nondominated sorting (Pymoo's implementation). Top-3 front solutions are presented to the user for deployment, providing explicit trade-offs.

5. Deployment, Profiling, and Visualization

Candidate models are profiled on multiple hardware backends:

  • CPU: ONNX Runtime with optimized BLAS backends (MKL-DNN, OpenBLAS)
  • GPU: TVM-compiled kernels deployed directly or via remote RPC (e.g., ARM Mali-G68)

Latency, throughput, CPU memory, GPU memory, and model size are measured. Visualization includes:

  • Per-layer activation error curves eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|4 for sensitivity diagnosis
  • Pareto frontier plots showing accuracy/size trade-offs across all candidates
  • Exportable PNG/HTML dashboards and JSON reports of metrics, candidate sets, and front solutions for reproducible deployment

6. Empirical Evaluation and Results

TuneQn was benchmarked on MobileNetV2, ShuffleNetV2, EfficientNet-Lite4, and ResNet50 models with both static and dynamic INT8 quantization, deployed on Intel i5 CPUs and Mali G68 MC4 GPUs. Validation used the ILSVRC2017 val set (5,500 images, 300 for per-iteration accuracy).

Key results:

  • Accuracy Preservation: On ShuffleNetV2 (GPU dynamic), selective quantization reduced post-quantization accuracy loss by 54.14% versus the fully quantized model.
  • Model Size Reduction: MobileNetV2 (GPU dynamic) achieved a 72.9% reduction versus original FP32, with best trade-off candidates identified in the Pareto front.
  • Task Scalability: At most eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|5 candidate models per quantization mode, efficient even for hundreds of layers.
  • Visualization: Diagnostics and Pareto-optimal points provide model developers with actionable insight and facilitate optimal candidate selection.

All empirical claims, equations, and tabular data are directly drawn from (Louloudakis et al., 16 Jul 2025).

7. Practical Implementation and Extensibility

TuneQn is invoked via a Python command line interface, supporting YAML/JSON configuration for model selection, quantization mode (static/dynamic), batch and chunk sizes, evaluation/calibration dataset sizes, and objectives for optimization. Results—including ONNX models for each eiqdq=1DxDQDQi(x)    Fi(x)e^{\mathrm{qdq}}_i = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \|\,\mathrm{QDQ}_i(x)\;-\;F_i(x)\|6, full per-candidate metrics, Pareto-ranked reports, and visualization plots—are stored for reproducible deployment.

Support for both per-channel and per-tensor quantization, automatic (sensitivity-based) or manual (user-specified) exclusion lists, and hardware-agnostic deployment (x86, GPU, ARM accelerators) is integral.

The tool bridges the ONNX quantization API, hardware-aware deployment (ONNX Runtime and TVM), and operationalizes multi-objective analysis within a reproducible, fully open-source stack, enabling systematic quantization tuning for edge, mobile, and server deployment scenarios (Louloudakis et al., 16 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TuneQn.