Multi-Tenant DNN Inference Overview
- Multi-Tenant DNN Inference is a paradigm for concurrently executing deep neural network tasks from multiple users on shared hardware, enhancing resource utilization and meeting diverse SLOs.
- Architectures decouple API routing, global control, scheduling, and execution engines across cloud, edge, and accelerator platforms using spatial and temporal sharing.
- Optimization techniques like batching, ML-guided scheduling, and secure isolation improve throughput, latency, and robustness against resource fragmentation and side-channel attacks.
Multi-Tenant DNN Inference
Multi-tenant DNN inference refers to simultaneous execution of deep neural network inference tasks originating from multiple, logically distinct users or applications (“tenants”) on shared hardware resources. This paradigm, foundational for both cloud and edge AI infrastructure, aims to maximize utilization, support diverse service-level objectives (SLOs), and isolate or prioritize workloads under contention, while mitigating security, performance, and resource fragmentation challenges.
1. Architectural Principles and System Models
Multi-tenant inference is deployed across cloud servers, edge platforms, and dedicated accelerators. The canonical architecture decouples request routing (API front-end), global control (model repository, admission, and mapping), scheduling (per-request and per-batch), and execution engines (workers/accelerators). In the cloud, stateless Predict APIs hide heterogeneous backends; on edge or IoT, tighter integration with resource schedulers (CPU/GPU/DLA) is typical (Samanta et al., 2019, Tayal et al., 12 Mar 2025).
Key resource dimensions include compute (CPU, GPU SMs, FPGA LUTs/DSPs, PIM tiles), memory (on-chip caches, DRAM, VRAM), and interconnect/bandwidth. Modern systems also target power and carbon constraints (Paramanayakam et al., 6 Mar 2025). Formally, inference scheduling is cast as a mixed-integer programming problem where decision variables encode model placement, concurrency level (tenancy), batch size, operator partitioning, and possible model variant selection. Constraints enforce per-tenant SLOs, system resource caps, and sometimes carbon/delay products:
subject to latency, power, and accuracy bounds (Paramanayakam et al., 6 Mar 2025).
2. Resource Partitioning, Scheduling, and Sharing Mechanisms
Two broad classes of sharing exist:
- Spatial Partitioning: Physical resources are divided among tenants, e.g., GPU SM blocks (MIG/MPS (Lee et al., 2024); temporal+spatial workspace of PIM/ISAAC tiles (Li et al., 2024); systolic array vertical splits (Reshadi et al., 2023)). Dynamic partitioners track active tenants and reallocate partitions as layers complete or workloads shift.
- Temporal/Multiplexed Sharing: Operators or entire models interleave at fine (operator) or coarse (stage, model) granularity, with runtime or ML-guided scheduling aligning launches to resource availability, often using CUDA stream primitives on GPUs (Yu et al., 2021, Yu et al., 2023).
Frameworks such as ParvaGPU combine hardware primitives (e.g., NVIDIA MIG, MPS) with offline profiling and online packing bin-packing algorithms to minimize both internal slack and fragmentation subject to SLOs (Lee et al., 2024).
Software-oriented techniques (e.g., GACER, MoCA) introduce spatial decomposition (splitting operators into micro-batches), temporal segmentation (DFG synchronization points), and runtime regulation (e.g., dynamic memory bandwidth throttling in MoCA) (Yu et al., 2023, Kim et al., 2023).
Table: Representative Resource Partitioning Approaches
| Approach | Resource Mode | Target Hardware | Key Optimization |
|---|---|---|---|
| ParvaGPU | Spatial + MPS | GPU (MIG+MPS) | Fragmentation minimization |
| GACER | Spat.+Temp. | GPU | Minimize spatial/temporal residue |
| Collaborative PIM | Spatial + Op. | ReRAM PIM | 2-level area and pipeline balance |
| Dynamic Partition | Spatial | Systolic Array Accel | On-the-fly vertical sub-arrays |
| MoCA | Temporal/Memory | Tile-based DNN Accel | Memory bandwidth regulation |
3. Throughput, Latency, and QoS Optimization
Throughput maximization hinges on the interplay of:
- Batching: Batching maximizes data-parallel throughput but may harm bursty or latency-sensitive requests.
- Tenancy Level: Increasing the number of concurrent in-flight model instances improves utilization, especially for lightweight or “skinny” DNNs that underutilize GPU SMs individually. DNNScaler detects which regime dominates for a given workload and adapts batch size (b) or tenancy level (n) accordingly, using profiling and SVD-based latency estimation (Nabavinejad et al., 2023).
Regimes identified:
- Memory-bound, large (e.g., ResNet-152): batching is superior.
- Compute-bound, small (e.g., MobileNet): multi-tenancy/parallel streams are superior.
Heuristic or ML-guided search (coordinate descent, random, or LA-MCTS) tunes these parameters under user SLO constraints (Yu et al., 2021, Paramanayakam et al., 6 Mar 2025).
Empirical results:
- DNNScaler: up to 14× throughput improvements, 218% average uplift over prior approaches, matching latency SLOs (Nabavinejad et al., 2023).
- GACER: 1.37–1.66× speed-up and ≈40% occupancy gains over single-tenant baselines (Yu et al., 2023).
- ParvaGPU: 46.5% GPU savings vs. gpulet, no SLO violations, and 3–5% residual slack (Lee et al., 2024).
- MoCA: up to 3.9× better SLA satisfaction, 2.3× higher throughput vs. static/temporal partitioning (Kim et al., 2023).
4. Isolation, Security, and Attack Surfaces
Multi-tenancy introduces complex security considerations: co-resident malicious tenants can exploit shared resource timing or data channels.
- Side-Channel Defenses: SESAME provides compiler- and hardware-managed software-defined enclaves: per-tenant queues, scratchpad partitioning, traffic shaping to erase timing channel correlations, and base/bound checks (Banerjee et al., 2020). Runtime and code-size overheads vary with the threat model (≈4–34%).
- Adversarial Attacks: Deep-Dup demonstrates that FPGA-based multi-tenant accelerators are vulnerable to power-distribution induced fault attacks: adversaries can inject transient glitches to duplicate (“AWD”) or corrupt critical DNN weight packages, causing catastrophic accuracy loss with as few as 1 injection in small models (Rakin et al., 2020). Defenses include on-chip storage for critical weights, randomization of DMA bursts, and architectural isolation, but with notable resource overheads.
5. Heterogeneity and Edge Deployments
Edge devices often bundle multiple distinct accelerators (CUDA/Tensor/DLA), yielding additional scheduling challenges:
- Multi-Accelerator Mapping: On Jetson Orin AGX, combining CUDA and Tensor Cores delivers peak throughput at large batch sizes; DLA benefits small batches but can trigger contention/interference (Tayal et al., 12 Mar 2025). Optimal scheduling allocates requests to resources by joint consideration of batch, precision constraints, and fallback paths.
- Model Adaptation and Sustainability: Ecomap dynamically selects “mixed-quality” model variants in response to power constraints and real-time carbon intensity, guided by a transformer-based estimator within a LA-MCTS scheduling loop. This strategy reduces operational carbon emissions by 30–40% and improves carbon-delay-product by up to 60% (Paramanayakam et al., 6 Mar 2025).
- On-Device Multi-App Adaptation: NestDNN supports per-application accuracy–resource trade-offs using “multi-capacity” nested models; the scheduler dynamically selects the most efficient sub-model under available resource budgets, delivering up to 2× frame rates and 1.7× energy reduction (Fang et al., 2018).
6. Application-Specific Strategies and Predictability
Application requirements motivate further specialization:
- Autonomous Vehicles: PP-DNN achieves predictability by selecting only critical frames and ROIs for multi-tenant DNN processing, supported by SSIM-based frame difference and bounding box trackers, plus lightweight scheduling to bound fusion delay variance and completeness (Liu et al., 11 Feb 2026). The result is a 7.3× increase in fusion frames and 75% gain in detection completeness over naїve approaches.
- Consumer Edge Cascades: MultiTASC orchestrates heterogeneous device-server cascades with confidence-based local thresholding, adaptive forwarding, and global queuing control to maximize SLO satisfaction (>90% at scale) and sustain accuracy (Nikolaidis et al., 2023).
7. Open Problems and Future Directions
Research continues on:
- Dynamic, reconfigurable hardware partitioning for arbitrarily-varying, online multi-tenant workloads (Li et al., 2024).
- Fully generalized resource management—beyond SMs to L2, VRAM, and interconnects—with compositional modeling and runtime feedback (Yu et al., 2023, Kim et al., 2023).
- Secure multi-tenancy under realistic threat models, minimizing overhead (Banerjee et al., 2020).
- Extension of joint batching–tenancy adaptation to transformer and diffusion models at scale (Nabavinejad et al., 2023).
- Integrating carbon-awareness and model quality tiering into cluster-level scheduling (Paramanayakam et al., 6 Mar 2025).
- Cross-platform multi-tenancy (e.g., across GPU, TPU, ASIC, PIM) and hybrid edge-cloud deployments.
Multi-tenant DNN inference is thus both a unifying abstraction for large-scale AI serving and a focal point for ongoing advances in resource optimization, predictability, hardware/software co-design, and trustworthy AI infrastructure.