Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Published 31 Mar 2026 in cs.CV and cs.AI | (2603.29535v1)

Abstract: Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

Summary

  • The paper introduces a novel QUAD framework that enables a single compiled quantized model to support multiple LoRA adapters through runtime injection.
  • It demonstrates up to 6× reduction in memory footprint and 4× faster inference latency while preserving image quality.
  • The approach refines quantization with adaptive distillation to harmonize disparate LoRA weight distributions for dynamic, efficient edge deployments.

Quantization with Unified Adaptive Distillation: Multi-LoRA One-for-All Generative Vision Models on Edge

Introduction and Problem Statement

This work addresses the deployment of multi-task generative vision models (LVMs) on edge devices, focusing on practical techniques to enable efficient, memory- and compute-constrained inference. The growth of GenAI-based features such as image inpainting and prompt-guided stylization on consumer hardware motivates research into model adaptation methods compatible with resource-limited platforms. While LoRA-based parameter-efficient fine-tuning is standard for task specialization, current edge deployment pipelines compile a separate binary for each LoRA, duplicating the foundation model and inflating both storage and runtime costs.

The primary technical barrier—quantization incompatibility across individually trained LoRA adapters—precludes runtime adapter switching under a unified compiled model, as each adapter typically yields unique quantization scales and offsets (see Figure 1). Figure 1

Figure 1: Schematic for a simplified version of compilation and deployment, illustrating per-LoRA binary duplication and quantization incompatibility.

Methodology: LoRA as Runtime Input and QUAD

The paper introduces a unified framework that enables a single compiled quantized LVM to serve multiple tasks dynamically by treating LoRA weights as runtime input tensors, rather than statically embedding them into the computation graph. This architectural shift (Figure 2) demands revising inference logic to expose LoRA weights as explicit model inputs, with graph modifications to accept AA, BB (the LoRA low-rank matrices) at runtime for all adapted linear layers. Figure 2

Figure 2: Unified deployment architecture—modification for LoRA as runtime input, single frozen graph construction, and on-device runtime logic.

The result is a single deployment artifact for multiple LoRAs, supporting dynamic task switching, over-the-air LoRA updates, and significant reductions in memory and storage footprint. However, the runtime injection of different LoRA adapters challenges quantization uniformity, as independently-trained LoRAs may possess disparate weight distributions, leading to distinct quantization encodings.

To solve this, the authors propose the QUAD (Quantization with Unified Adaptive Distillation) framework. The method consists of the following steps:

  • Quantization sensitivity analysis selects the most sensitive LoRA adapter (highest divergence in quantized-vs-FP32 output) as the anchor for fixed quantization parameters.
  • All LoRA adapters are then finetuned using knowledge distillation under this shared quantization configuration, enforcing compatibility at inference.
  • If no LoRA shows higher sensitivity, a global quantization profile is established by merging weight distributions.

This process ensures all LoRAs and the base model share quantization parameters, rendering the runtime graph invariant to adapter choice and compatible with fixed-parameter hardware accelerators. The QUAD flowsheet is formalized in Figure 3. Figure 3

Figure 3: QUAD framework—adapter selection, quantization parameter sharing, and distillation for deployment.

System Architecture and Deployment

Deploying the unified quantized LVM+LoRA stack requires a specialized software infrastructure. Model export via ONNX and further vendor toolchains enables graph-level optimizations, including operator fusion, dead code elimination, parallelization, and static allocation of buffer placeholders for LoRA weights. The complete stack for on-device execution is depicted in Figure 4, constituting the compiled IR, lightweight runtime, LoRA loaders, and a task scheduler. Figure 4

Figure 4: Software stack for "one-for-all" LVMs on mobile; a single foundation model with dynamic LoRA loading serves multiple use-cases.

This system is validated on multiple SoC platforms (Qualcomm, MediaTek, LSI), with dynamic task switching, LoRA caching, and compatibility with quantized operator sets, indicating generality across commercial hardware.

Experimental Results

The unified approach offers empirically validated improvements:

  • Memory footprint is reduced by up to 6×6\times compared to redundant per-LoRA binaries.
  • End-to-end latency improves by up to 4×4\times due to the elimination of runtime model reloading.
  • Visual fidelity metrics (FID, LPIPS, SSIM, PSNR) are preserved, confirming the approach maintains generative quality post-INT8 quantization and distillation.

For prompt-guided image transformation, quantitative parity is observed between server-side FP32 and on-device INT8 execution (see Figure 5). Figure 5

Figure 5: Prompt-guided image transformation—server FP32 vs. device INT8 output fidelity.

Object removal use-cases display negligible accuracy decrease post-quantization, as reflected across FID and SSIM metrics (Figure 6). Figure 6

Figure 6: Object removal performance; on-device INT8 model preserves image quality.

Comprehensive profiling across multiple chipsets demonstrates consistent improvements in peak RAM, ROM requirements, and total inference time for both high- and medium-capacity LVM backbones.

Ablation Studies: Quantization Strategies

The paper further investigates the impact of mixed-precision quantization. Limiting quantization to W8A16 (weights INT8, activations INT16) optimally balances accuracy and memory reduction, whereas aggressive INT8 quantization (W8A8) for LoRA weights/activations increases memory savings but at measurable performance cost. Mixed-precision configurations provide knobbed trade-offs, allowing system designers fine-grained control over deployment accuracy vs. resource utilization.

Implications and Future Directions

This work advances practical deployment of modular, parameter-efficient generative vision models on edge hardware. By fully decoupling LoRA adapters from model binaries and resolving quantization compatibility at the deployment stage, this approach enables scalable, updatable, and memory-efficient on-device GenAI, significantly lowering total cost of ownership as the number of supported use-cases grows.

The approach generalizes to other foundation models and any scenario demanding runtime injection of trained adapters. Future developments may extend to real-time LoRA personalization, federated LoRA learning, or distributed edge-to-cloud multi-adapter orchestration. Research into more adaptive quantization/finetuning strategies could further expand the efficiency and fidelity achievable for on-device vision LLMs.

Conclusion

By reframing LoRA integration as a runtime operation and proposing the QUAD knowledge-distillation-anchored quantization strategy, this work enables robust, efficient, multi-task GenAI model deployment on resource-constrained edge devices. The unified stack circumvents the constraints of conventional per-task binaries, achieving substantial gains in adaptability, memory, and inference latency—all while sustaining competitive generative performance across diverse applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.