TuneQn: Open-Source ONNX Quantization Suite
- TuneQn is an open-source suite for multi-objective, sensitivity-driven selective quantization of ONNX models.
- It integrates layer sensitivity analysis, candidate generation, hardware deployment, and Pareto optimization to balance accuracy, model size, and performance.
- Evaluated on models like MobileNetV2 and ShuffleNetV2, TuneQn demonstrates significant accuracy preservation and model size reduction on varied CPU and GPU platforms.
TuneQn is an open-source suite for systematic, multi-objective selective quantization of ONNX models across CPU and GPU targets. It addresses the problem that full-model quantization frequently induces unacceptable accuracy loss or fails to run efficiently on resource-constrained hardware, while purely manual or uninformed selective quantization is non-scalable. By integrating layer-level sensitivity analysis, stepwise exclusion, deployment on target accelerators, and Pareto front optimization, TuneQn enables principled selection of quantization assignments that best trade off accuracy, model size, and performance metrics for a given backend (Louloudakis et al., 16 Jul 2025).
1. Architectural Overview and Workflow
TuneQn implements a five-stage linear pipeline: model orchestration, activation-driven layer sensitivity analysis, candidate generation via selective quantization, deployment and profiling on selected devices, and Pareto-optimal multi-objective analysis. The workflow is fully automated from ingestion of an ONNX model and a calibration/validation dataset, through layerwise analysis and candidate generation, to per-candidate empirical evaluation and selection of top-Pareto models.
Stage Modules
| Module | Function | Key Outputs |
|---|---|---|
| Model Orchestrator | Loads and parses ONNX graphs, extracts layer meta-data | Layer list, shapes, opset, configuration |
| Layer Activation Analysis | Quantifies sensitivity of each layer to quantization via calibration | Layer ranking by normalized sensitivity score |
| Selective Quantization | Excludes k most sensitive layers, generates candidate assignments | Partially quantized ONNX models |
| Runner | Deploys models on CPU (ONNX Runtime) / GPU (TVM), benchmarks | Measured accuracy, latency, storage |
| Metrics Benchmarking/Visualizer | Computes Pareto front and visualizes summary statistics | Plots, metrics, candidate selection |
This pipeline ensures scalability, checkpointing, and direct support for hardware-specific profiling.
2. Sensitivity-Driven Selective Quantization Methodology
TuneQn formalizes selective quantization as a layer-ranking and exclusion process. The set of quantizable layers is evaluated using two activation-based error metrics under full quantization:
- QDQ-Error: Per-layer discrepancy in "quantize-dequantize" output vs. original FP32 output across a calibration set:
- XModel-Error: Relative error in fully quantized kernel output:
These errors are normalized and averaged to a combined sensitivity score , providing a quantitative basis for ranking and exclusion:
Candidates are generated by successively excluding the layers with highest from quantization, forming models for .
3. Quantization Schemes and ONNX Graph Construction
TuneQn supports static INT8 quantization (weights, biases, activations with calibration), dynamic INT8 quantization (weights/biases quantized, activations quantize-dequantized at runtime), and uniform FP16 conversion. Per-candidate models are constructed by marking ONNX nodes corresponding to layers in with standard (FP32) operators and quantizing all others, using appropriate ONNX subgraph nodes (QLinearConv, QLinearMatMul, DequantizeLinear, etc.).
Calibration (for static INT8) uses up to 50 images to derive min/max ranges per channel or per tensor.
4. Multi-Objective Pareto Optimization Strategy
For each candidate 0, empirical measurements on the validation set yield:
- 1: Accuracy degradation compared to baseline FP32
- 2: Model storage footprint (MB)
- Additional metrics: Inference latency, throughput, peak device memory
TuneQn formulates the model selection as a bi-objective minimization:
3
The Pareto front of non-dominated candidates is computed using nondominated sorting (Pymoo's implementation). Top-3 front solutions are presented to the user for deployment, providing explicit trade-offs.
5. Deployment, Profiling, and Visualization
Candidate models are profiled on multiple hardware backends:
- CPU: ONNX Runtime with optimized BLAS backends (MKL-DNN, OpenBLAS)
- GPU: TVM-compiled kernels deployed directly or via remote RPC (e.g., ARM Mali-G68)
Latency, throughput, CPU memory, GPU memory, and model size are measured. Visualization includes:
- Per-layer activation error curves 4 for sensitivity diagnosis
- Pareto frontier plots showing accuracy/size trade-offs across all candidates
- Exportable PNG/HTML dashboards and JSON reports of metrics, candidate sets, and front solutions for reproducible deployment
6. Empirical Evaluation and Results
TuneQn was benchmarked on MobileNetV2, ShuffleNetV2, EfficientNet-Lite4, and ResNet50 models with both static and dynamic INT8 quantization, deployed on Intel i5 CPUs and Mali G68 MC4 GPUs. Validation used the ILSVRC2017 val set (5,500 images, 300 for per-iteration accuracy).
Key results:
- Accuracy Preservation: On ShuffleNetV2 (GPU dynamic), selective quantization reduced post-quantization accuracy loss by 54.14% versus the fully quantized model.
- Model Size Reduction: MobileNetV2 (GPU dynamic) achieved a 72.9% reduction versus original FP32, with best trade-off candidates identified in the Pareto front.
- Task Scalability: At most 5 candidate models per quantization mode, efficient even for hundreds of layers.
- Visualization: Diagnostics and Pareto-optimal points provide model developers with actionable insight and facilitate optimal candidate selection.
All empirical claims, equations, and tabular data are directly drawn from (Louloudakis et al., 16 Jul 2025).
7. Practical Implementation and Extensibility
TuneQn is invoked via a Python command line interface, supporting YAML/JSON configuration for model selection, quantization mode (static/dynamic), batch and chunk sizes, evaluation/calibration dataset sizes, and objectives for optimization. Results—including ONNX models for each 6, full per-candidate metrics, Pareto-ranked reports, and visualization plots—are stored for reproducible deployment.
Support for both per-channel and per-tensor quantization, automatic (sensitivity-based) or manual (user-specified) exclusion lists, and hardware-agnostic deployment (x86, GPU, ARM accelerators) is integral.
The tool bridges the ONNX quantization API, hardware-aware deployment (ONNX Runtime and TVM), and operationalizes multi-objective analysis within a reproducible, fully open-source stack, enabling systematic quantization tuning for edge, mobile, and server deployment scenarios (Louloudakis et al., 16 Jul 2025).