Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Instance GPU Partitioning

Updated 30 January 2026
  • Multi-Instance GPU Partitioning is a hardware mechanism that subdivides GPUs into fully isolated slices, each allocated dedicated SMs, caches, and memory banks.
  • It delivers robust multi-tenancy and performance guarantees by ensuring strict resource isolation and near-linear throughput scaling in deep learning, HPC, and AI inference workloads.
  • Dynamic scheduling algorithms, including greedy and reinforcement learning approaches, are employed to mitigate fragmentation and optimize GPU resource utilization.

NVIDIA's Multi-Instance GPU (MIG) partitioning technique is a hardware-enforced spatial partitioning mechanism that enables physical GPUs—most notably Ampere-class GPUs such as the A100 and newer Hopper/Blackwell devices—to be subdivided into multiple, fully isolated GPU "slices" or "1" Each instance receives a dedicated, non-overlapping subset of Streaming Multiprocessors (SMs), on-chip caches, and DRAM banks, and is exposed as a separate logical GPU to software frameworks. This hard partitioning supports true multi-tenancy, strong resource isolation, and fine-grained cluster resource management for deep learning, HPC, and AI inference workloads (Zhang et al., 2023, Kim et al., 12 Nov 2025, Siavashi et al., 4 Feb 2025, Turkkan et al., 2024).

1. Hardware Architecture and Partitioning Model

Modern NVIDIA GPUs supporting MIG consist of a modular microarchitecture—comprising SMs (grouped into GPCs), L2/L1 caches, and multiple HBM memory controllers. Under MIG partitioning, the chip is pre-divided into a fixed number of SM and memory "slices" (e.g., 7 compute and 8 memory slices on A100-80GB), from which legal "MIG instance profiles" are carved (Martín et al., 12 Jan 2026, Kim et al., 12 Nov 2025).

Each supported profile specifies the number of contiguous SM and memory slices assigned:

Profile SM slices Memory slices Maximum per GPU Example Capacity
1g.5gb 1 1 7 1/7 SMs, 1/8 DRAM (5GB)
1g.10gb 1 2 4 1/7 SMs, 1/4 DRAM (10GB)
2g.10gb 2 2 3 2/7 SMs, 1/4 DRAM
3g.20gb 3 4 2 3/7 SMs, 1/2 DRAM (20GB)
4g.20gb 4 4 1 4/7 SMs, 1/2 DRAM
7g.40/80gb 7 8 1 all SMs, full DRAM

These partitioning rules are hardcoded in firmware and strictly enforced at slice boundaries. Each MIG instance is mapped to a distinct set of SMs and memory banks, exposing a unique device UUID for scheduling/container orchestration (Zhang et al., 2023, Robroek et al., 2022).

2. Resource Isolation and Performance Guarantees

Each MIG instance is strictly isolated at the hardware level—compute, L2/L1 caches, HBM memory, and memory controller resources are statically partitioned:

  • Compute and L2 Isolation: Each instance has exclusive access to its SMs and private L2 cache banks. Contexts in one instance cannot evict cache lines or preempt SMs in another (Zhang et al., 2023).
  • Memory Isolation: Each instance is guaranteed exclusive access to a fixed set of HBM memory controllers and address ranges.
  • DRAM Bandwidth: Contention-free across instances, as each gets a statically defined bandwidth proportional to its assigned memory slices.
  • PCIe and System Bus: Shared across instances—introducing residual contention for DMA, especially in cluster/NUMA setups (Darzi et al., 27 Aug 2025).

Empirically, throughput scales near-linearly with allocated SMs/memory up to medium batch sizes; low tail-latency is preserved even under oversubscription, and SLO guarantees (e.g., p99 latency) are maintained in multi-tenant inference scenarios (Zhang et al., 2023, Martín et al., 12 Jan 2026, Lee et al., 2024).

3. Partitioning Constraints, Fragmentation, and Scheduling Challenges

Partitioning flexibility is strongly constrained:

  • Fixed Profiles and Contiguity: Only certain (SM, memory) combinations are valid; each partition must be composed of contiguous slices and start at a legal index (e.g., 2g.10gb partitions must align with available index sets) (Kim et al., 12 Nov 2025, Ting et al., 18 Dec 2025).
  • Fragmentation: Rigid placement constraints lead to both internal (unused resources within allocated slices) and external (unschedulable free slices due to misalignment) fragmentation. Quantitative metrics such as fragmentation score FF count unschedulable regions (Zambianco et al., 24 Nov 2025, Ting et al., 18 Dec 2025).
  • Dynamic Scheduling: Workload arrivals and departures in a multi-tenant cloud can rapidly fragment resources, degrading acceptance rates and overall utilization if not mitigated (Siavashi et al., 4 Feb 2025, Kim et al., 12 Nov 2025).
  • Reconfiguration Overhead: Changing the partitioning on hardware incurs substantial downtime, as active contexts must be quiesced and the device must be re-initialized (on A100, O(10)O(10)O(100)O(100) s per drain/repartition event) (Kim et al., 12 Nov 2025).

Effective scheduling requires fragmentation-aware resource allocation, often via sophisticated heuristics or search algorithms, to maximize workload acceptance and cluster utilization.

4. Scheduling Algorithms and Partition Optimization

A range of algorithms have been developed to address MIG partitioning and scheduling:

  • Greedy Fragmentation-Minimizing Allocators (e.g., Minimum Fragmentation Increment (MFI)): On each arrival, choose the (GPU, index) pair that minimally increases a formal fragmentation metric, reserving larger contiguous slices for future large profiles (Zambianco et al., 24 Nov 2025, Ting et al., 18 Dec 2025).
  • Multi-Objective Placement: Integer linear programming (ILP) and heuristics balancing request acceptance, active hardware, and migration cost, e.g., the GRMU framework introduces quota-based double baskets and intra-GPU defragmentation (Siavashi et al., 4 Feb 2025).
  • Moldable and Dynamic Scheduling: The FAR algorithm integrates moldable task allocation with MIG’s binary tree reconfiguration, combining list scheduling and repartitioning-tree search for makespan minimization (Villarrubia et al., 18 Jul 2025).
  • AutoML and RL Techniques: Hierarchical resource partitioning with deep reinforcement learning (e.g., DQN) for co-optimizing MIG and MPS allocations over job windows, achieving significant throughput gains (Saroliya et al., 2024).
  • Dynamic Power and Energy-Aware Partitioning: Combining MIG with per-partition power modeling, using ML-based models for fair and accurate attribution, and jointly optimizing MIG configuration and power capping under throughput and fairness constraints (Vamja et al., 29 Jan 2025, Arima et al., 2024).

A representative formalism for fragmentation cost per GPU, capturing external fragmentation across all profiles, is

FragCost(Gi)=11Mj=1Mfeasible_mig_num(Gi,Mj)ideal_mig_num(Gi,Mj),FragCost(G_i)= 1 - \frac{1}{|\mathcal M|} \sum_{j=1}^{|\mathcal M|} \frac{feasible\_mig\_num(G_i,M_j)}{ideal\_mig\_num(G_i,M_j)},

where feasible_mig_numfeasible\_mig\_num denotes valid, index-aligned placements and ideal_mig_numideal\_mig\_num is the unconstrained slice count for the profile (Ting et al., 18 Dec 2025).

5. Integration with Software Ecosystems and Cluster Management

MIG partitions are surfaced as separate CUDA devices, each with UUIDs and physical resource guarantees:

  • Orchestration: Kubernetes plugins and custom controllers treat each MIG slice as a resource, supporting per-instance scheduling and job isolation at scale (Tan et al., 2021, Kim et al., 12 Nov 2025).
  • Multi-Tenancy and Billing: Isolation supports per-tenant quota enforcement, billing, and carbon reporting at the slice level (Vamja et al., 29 Jan 2025).
  • Mixed Mode Usage: MIG can be combined with finer-grained logical partitioning (MPS) within each instance, enabling hierarchical scheduling for further utilization improvement (Saroliya et al., 2024, Lee et al., 2024).
  • Flexible Job Allocation: Software frameworks such as Flex-MIG treat MIG leaves as fixed atomic units, performing job aggregation and collective communication at the software layer, thus eliminating expensive hardware reconfiguration (Kim et al., 12 Nov 2025).

Notably, in cloud production, practical best practices involve quota-based baskets for fairness, controlled migration to avoid sequential disruptions, and periodic rebalancing or maintenance-window compaction (Siavashi et al., 4 Feb 2025, Turkkan et al., 2024).

6. Empirical Impact and Limitations

MIG partitioning yields:

Limitations include:

  • Rigid profile menu: Only a small, fixed number of slice sizes and legal placements; no support for arbitrary or fine-grained slicing, nor dynamic resizing of live instances (Martín et al., 12 Jan 2026, Saraha et al., 25 Aug 2025).
  • Fragmentation sensitivity: Prolonged workloads, dynamic arrivals/departures, or ill-tuned partition strategies can result in stranded resources and degraded acceptance rates (Zambianco et al., 24 Nov 2025, Ting et al., 18 Dec 2025).
  • Driver and reconfiguration overheads: Partition changes are heavyweight (requiring drains and device resets), putting practical constraints on real-time agility (Kim et al., 12 Nov 2025).
  • Limited availability on embedded or edge architectures: Currently restricted to select datacenter-class devices (Martín et al., 12 Jan 2026).

7. Future Directions in Research and Deployment

Notable future directions include:

  • Support for finer-granularity slicing: SM- or warp-level partitioning for sub-GPC resource isolation to better fit small AI models (Martín et al., 12 Jan 2026).
  • Online fragmentation prediction and live defragmentation: Incorporate predictive workload duration modeling, machine learning–driven placement, and potentially live-migration or dynamic rebalancing to mitigate fragmentation under adversarial workloads (Ting et al., 18 Dec 2025, Zambianco et al., 24 Nov 2025).
  • Co-optimization with MPS and software aggregation: Realizing hybrid systems that exploit both hardware and software-level partitioning for low-overhead, high-utilization operation (Saroliya et al., 2024, Lee et al., 2024).
  • Dynamic power/energy policy: Integrating adaptive partition selection and per-instance power capping for green computing and fair billing (Vamja et al., 29 Jan 2025, Arima et al., 2024).
  • Cluster-level scheduling with SLO guarantees: Cross-node cluster managers that allocate, migrate, and consolidate MIG resources efficiently under mixed SLOs, possibly incorporating hierarchical and federated resource control (Siavashi et al., 4 Feb 2025, Turkkan et al., 2024).

Empirical research continues to demonstrate that, when combined with intelligent scheduling frameworks and real-time monitoring, MIG partitioning underpins robust, high-utilization multi-tenant GPU clusters for a diverse range of AI and HPC workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Instance GPU (MIG) Partitioning Technique.