Multi-Instance GPU (MIG) Technology
- Multi-Instance GPU (MIG) is a hardware-based technology that partitions a physical GPU into isolated instances with dedicated compute and memory resources.
- It supports key applications like inference consolidation, cloud VM placement, and serverless LLM serving through fixed, discrete profiles on systems like A100 and H100.
- Challenges include rigid legal profiles, high reconfiguration overhead, and shared resource bottlenecks such as L3 TLB and power management.
Searching arXiv for recent relevant work on Multi-Instance GPU (MIG) to support the article. arXiv search: "Multi-Instance GPU MIG scheduling inference training fragmentation isolation" Multi-Instance GPU (MIG) is NVIDIA’s hardware-based GPU spatial-sharing technology for partitioning one physical GPU into multiple isolated GPU instances with dedicated compute and memory resources. In the literature, MIG is presented as a mechanism for multi-tenancy, inference consolidation, continuous learning, cloud VM placement, and serverless LLM serving, particularly on A100-, H100-, A30-, and Grace Hopper-class systems. The same literature also shows that MIG is not simply a smaller-GPU abstraction: its fixed profiles, legal-placement constraints, dynamic reconfiguration costs, and residual shared resources turn GPU sharing into a constrained scheduling and systems problem rather than a straightforward partitioning exercise (Li et al., 2022, Villarrubia et al., 24 Apr 2026, Tan et al., 2021).
1. Partitioning model and device abstractions
MIG partitions a physical GPU into hardware-isolated slices and groups those slices into GPU instances. Several papers use closely related terminology. One line of work describes the smallest unit as a slice, bundling a block of compute resources, part of L2 cache, and part of DRAM; several consecutive slices form an instance, and a set of disjoint instances forms a partition (Villarrubia et al., 18 Jul 2025). Another characterizes MIG at the GPC (Graphics Processing Cluster) level, where slices are grouped into independently usable instances and the set of instances forms a partition (Villarrubia et al., 24 Apr 2026). System papers on MIG-enabled clouds and inference servers also distinguish a GPU instance (GI) from a compute instance (CI): a GI combines a portion of SMs, HBM capacity, L2 cache, copy engines, and access paths to memory controllers, while a CI is created on top of a GI and is the object on which users actually run workloads (Schieffer et al., 9 Apr 2026, Ting et al., 18 Dec 2025).
The profile notation encodes the coupled compute-memory allocation. On the A100 40GB, the smallest instance reported in multiple studies is 1g.5gb, corresponding to 1 compute slice + 1 memory slice, 14 SMs, and 5 GB memory (Robroek et al., 2022). For the A100-80GB, the profile set includes 1g.10gb, 1g.20gb, 2g.20gb, 3g.40gb, 4g.40gb, and 7g.80gb, with memory fractions from 1/8 to 8/8 and SM fractions from 1/7 to 7/7 (Vamja et al., 29 Jan 2025). On Grace Hopper H100 systems with 96 GB HBM3 and 132 SMs, reported profiles include 1g.12gb, 1g.24gb, 2g.24gb, 3g.48gb, 4g.48gb, and 7g.96gb; the 1g.12gb profile is reported as 16 SMs and 11 GiB usable memory (Schieffer et al., 9 Apr 2026).
This resource model is the basis for the frequent description of each MIG slice as a small exclusive GPU. PREBA states that each MIG slice can be handed to a separate VM or inference server with performance isolation via SR-IOV, while MISO emphasizes that MIG partitions not only SMs but also GPU memory, cache, memory bandwidth, and error domains (Yeo et al., 2024, Li et al., 2022). The common implication is that MIG exposes a hardware-enforced spatial resource boundary rather than a software-only admission policy.
2. Isolation semantics and remaining shared bottlenecks
The principal technical appeal of MIG is strong isolation. Comparative studies state that MIG physically isolates compute and memory resources, including SMs, L2 cache, DRAM slices, bandwidth via dedicated interconnect paths, memory controllers, DRAM address buses, and other on-chip memory-system resources (Villarrubia et al., 24 Apr 2026, Li et al., 2024). This is the central distinction from MPS, which is consistently described as software-mediated sharing: multiple processes connect to an MPS daemon, share a single CUDA context, and mainly receive SM provisioning, while the memory hierarchy remains shared (Villarrubia et al., 24 Apr 2026). MISO makes the same point in stronger systems terms, arguing that MIG is more powerful than MPS for multi-tenant isolation because it partitions not only SMs but also GPU memory, cache, memory bandwidth, and error isolation between applications (Li et al., 2022).
The literature is equally explicit that this isolation is incomplete. The most detailed architectural critique concerns address translation: while the L1 and L2 TLBs are partitioned along instance boundaries, the L3 TLB is shared across all MIG instances, which creates interference, page-table-walk amplification, and underutilization of the 16 sub-entries contained in each NVIDIA L3 TLB entry (Li et al., 2024). Other works identify PCIe bandwidth and the last-level TLB as shared contention points in online schedulers (Ting et al., 18 Dec 2025). On Grace Hopper systems, MIG also leaves the GPU power budget and Nvlink-C2C CPU-GPU interconnect shared, so performance isolation can still be perturbed by power throttling or C2C contention (Schieffer et al., 9 Apr 2026, Luo et al., 19 May 2026).
These residual bottlenecks have measurable effects. The STAR TLB design reports 28.8% average L3 TLB hit-rate improvement, 31.4% average sub-entry utilization improvement, and 30.2% average performance improvement across multi-tenant MIG workloads by dynamically sharing TLB entries across base addresses (Li et al., 2024). A comprehensive MPS–MIG comparison finds that MPS can improve performance by up to 30% and reduce energy by about 20% in favorable cases, but can worsen performance by around 30% under memory contention; MIG, by contrast, provides more consistent improvement because full hardware isolation resolves memory contention, although its gains are tempered by higher overhead and rigid partitioning (Villarrubia et al., 24 Apr 2026). A plausible implication is that MIG should be understood as strong-but-not-total isolation: it suppresses major cache and DRAM interference modes, but leaves a smaller set of shared microarchitectural and platform resources that can still dominate specific workload pairs.
3. Legal profiles, reconfiguration, and fragmentation
MIG’s rigidity is as central to the literature as its isolation. The legal resource sizes are discrete rather than continuous. On the A100 40GB, MISO states that the smallest slice is 1g.5gb, equal to 1/7 of the SMs and 5GB memory, and that the device has only 18 valid MIG configurations (Li et al., 2022). Other scheduling work on A100/H100 enumerates 19 valid partitions or 19 possible partitions, while A30 is reported as having 4 slices and only 5 partitions (Villarrubia et al., 18 Jul 2025, Villarrubia et al., 24 Apr 2026). This suggests different counting conventions for the legal configuration space rather than a disagreement about its rigidity.
Placement legality is more restrictive than simple capacity arithmetic. A100 profile placement is tied to allowed starting indices: 7g.80gb starts at [0], 4g.40gb at [0], 3g.40gb at [4,0], 2g.20gb at [4,0,2], 1g.20gb at [6,4,0,2], and 1g.10gb at [6,4,5,0,1,2,3] (Turkkan et al., 2024). Related fragmentation studies restate the same structural property on A100 by listing valid starting locations for 7g.40gb, 4g.20gb, 3g.20gb, 2g.10gb, 1g.10gb, and 1g.5gb, and explicitly note that a GPU may contain enough contiguous free slices for a request and still fail to create the requested instance because the free block is at an invalid starting index (Ting et al., 18 Dec 2025, Zambianco et al., 24 Nov 2025).
Reconfiguration is therefore useful but operationally costly. Comparative work states that idle instances can be destroyed and recreated without disrupting other running instances, and that reconfiguration is transparent to unmodified instances (Villarrubia et al., 24 Apr 2026). However, MISO argues that direct exploration of candidate partitions is too expensive because reconfiguring MIG may require slices to be idle, which means stopping jobs, resetting the GPU, and checkpoint/restarting workloads; in that study, MIG-based profiling can incur up to 8× more overhead than MPS-mode profiling (Li et al., 2022). Flex-MIG reports an even stronger operational cost in Kubernetes environments, stating that drain-required reconfiguration can take roughly 100–120 seconds end-to-end (Kim et al., 12 Nov 2025).
This legality model produces a distinct notion of fragmentation. ParvaGPU separates GPU internal slack, where an allocated partition is larger than the workload needs, from GPU external fragmentation, where remaining free MIG slots are scattered and cannot accommodate future larger requests (Lee et al., 2024). Fragmentation-aware schedulers refine the definition further for MIG: external fragmentation can occur even when free resources are contiguous, because the requested profile cannot be placed at a valid starting index (Ting et al., 18 Dec 2025, Zambianco et al., 24 Nov 2025). One online scheduler formalizes this with
interpreting the result as average unavailability of MIG instances on GPU (Ting et al., 18 Dec 2025). Another cloud scheduler introduces Minimum Fragmentation Increment (MFI) and reports about a 10% average improvement in the number of scheduled workloads in heavy load conditions, while using approximately the same number of GPUs as benchmark methods (Zambianco et al., 24 Nov 2025). GRMU, an ILP-derived VM placement framework with intra-GPU defragmentation and inter-GPU consolidation, reports 22% higher acceptance, 17% lower active hardware usage, and migration for only 1% of MIG-enabled VMs on an Alibaba trace (Siavashi et al., 4 Feb 2025).
4. Scheduling, optimization, and dynamic control
Because the profile space is discrete and the performance of workloads across slices is not monotone in the usual multicore sense, MIG scheduling is typically formulated as a joint partition-selection and placement problem. MISO expresses this directly: for jobs , it chooses
and solves
Its distinguishing systems idea is to avoid brute-force exploration of MIG partitions by using MPS as a proxy: the predictor is an autoencoder-style U-Net model that consumes a 3 × 7 matrix of MPS profiling measurements at 100%, 50%, 14% active-SM levels and predicts a 3 × 7 matrix for the main MIG slices 7g, 4g, and 3g, with 2g and 1g inferred afterward by linear regression with . The optimizer then enumerates valid configurations in at most 0.5 ms. On the reported workloads, MISO achieves 49% lower average job completion time than an unpartitioned GPU scheme, 16% lower average JCT than the optimal static partition scheme, 23% lower makespan, and 35% higher system throughput, while staying within 10% of Oracle on key metrics (Li et al., 2022).
Other work explores different objective functions. MIGRator targets multi-tenant continuous learning and formulates dynamic instance selection as an Integer Linear Programming (ILP) problem that maximizes Goodput, combining inference SLO attainment and model accuracy. In the reported evaluation, it outperforms Ekya, Astraea, and PARIS by 17%, 21%, and 20%, respectively (Wang et al., 2024). SMART-MIG uses Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) for large-scale repartitioning and heuristic scheduling based on EDF, MET, and CEDF; it claims constant repartitioning complexity as the number of jobs and GPUs grows, improves energy–tardiness efficiency by 18% relative to its static-partitioning counterpart, and is only 27% above the theoretical lower bound on energy consumption (Yu et al., 29 Jun 2026). A related single-MIG framework based on Deep Q-Learning (DQN) combines dynamic repartitioning with a restricted EDF-SS scheduler and reports improvements of 26% over twice-daily repartitioning, 31% over static partitioning, and 68% over no partitioning according to a combined energy–tardiness objective (Lipe et al., 23 Jun 2026).
Offline batch scheduling poses a different problem. FAR treats MIG as a moldable-task platform with dynamic reconfigurations. It rejects the standard work-monotonicity assumption because kernels on MIG can superscale owing to isolated memory bandwidth, then derives an approximation factor of $7/4$ on A30 and 2 on A100/H100. In real experiments including reconfiguration cost, it reports a makespan with respect to the optimum no worse than 1.22x for a benchmark suite and 1.10x for synthetic inputs, outperforming both static partitioning and prior MIG scheduling proposals (Villarrubia et al., 18 Jul 2025). Collectively, these schedulers show that MIG’s practical value depends less on the mere existence of hardware partitions than on whether the controller can select profiles, placements, and reconfiguration times under discrete legality constraints.
5. Workload-specific deployments
Inference serving is the most extensively optimized MIG application domain. PREBA studies MIG-based AI inference servers and reports that once an A100 is partitioned into fine-grained slices, CPU-side preprocessing becomes the dominant bottleneck: enabling preprocessing causes a 75.6% throughput drop in a 1g.5gb(7x) configuration, and for CitriNet the preprocessing path alone would require 393 CPU cores to sustain the model-execution throughput of a single 1g.5gb(7x) A100 server. PREBA addresses this with an FPGA-based Data Processing Unit (DPU) attached over PCIe and a dynamic batching system keyed by the throughput–tail-latency knee of each model and MIG configuration. The end-to-end result is 3.7× higher throughput, 3.4× lower tail latency, 3.5× better energy efficiency, and 3.0× better cost efficiency (Yeo et al., 2024). ParvaGPU pursues a different composition, using MIG for cross-workload isolation and MPS inside each MIG slice for homogeneous intra-workload sharing; it reports no SLO violations, 46.5% less GPU usage than gpulet, 34.6% less than iGniter, and 41.0% less than MIG-serving, while achieving 3–5% internal slack and completely eliminating external fragmentation in the evaluated scenarios (Lee et al., 2024). Earlier work on MIG-serving formulates DNN serving as a Reconfigurable Machine Scheduling Problem and reports that an optimizer pipeline combining greedy search, GA, and MCTS can save up to 40% of GPUs relative to using A100 as-is while providing the same throughput (Tan et al., 2021).
Training workloads present a more qualified picture. A study of deep learning collocation on A100 reports that collocating multiple model-training runs can yield up to four times training throughput despite increased epoch time, but concludes that MPS consistently outperforms both naive collocation and MIG for a single user, with up to 80% higher throughput than naive collocation and 40% higher throughput than MIG (Robroek et al., 2022). MIGPerf reaches a similar systems conclusion from a benchmarking angle: MIG is useful for both training and inference characterization and for hybrid deployment, but framework compatibility is immature, with tested training and serving frameworks able to detect and use MIG 0 while reporting No device or Device not found for the second GI in a two-GI setup (Zhang et al., 2023). These results counter a common simplification that MIG is uniformly superior to software sharing: for small and contentious inference services, isolation and tail-latency stability can dominate, whereas for single-user training MPS often retains the flexibility advantage.
Recent LLM-serving work on Grace Hopper changes the memory premise of MIG rather than its isolation model. C2CServe treats MIG as the execution and accounting unit but uses NVLink-C2C to keep model weights in CPU memory and stream them directly into MIG instances through zero-copy unified addressing. This allows a MIG slice to switch models across requests without reloading weights into HBM. The system combines bandwidth-aware model placement, chunking, and a heterogeneous-memory HybridGEMM kernel that interpolates between output-stationary and weight-stationary dataflows through a single knob . On GH200, C2CServe reports cold-start latency reductions of up to 7.1x for dense models and 4.6x for MoE models, while maintaining over 95% TTFT and TPOT attainment under C2C contention (Luo et al., 19 May 2026). This suggests that some current limitations of small MIG slices are increasingly memory-hierarchy limitations rather than purely compute-allocation limitations.
6. Power attribution, efficiency, and evolving operational models
MIG complicates observability as much as scheduling. Power is the clearest example. A partition-level power-accounting study notes that when a GPU is in MIG mode, utilization metrics are reported per partition, but power is only available for the full GPU. On an A100-80GB, idle power is reported as approximately 85 W at frequencies above 1200 MHz, versus a peak of around 400–450 W. The paper finds that a single generic offline model does not transfer well across diverse workloads, especially with concurrent MIG usage, and that online models using MIG-level features from concurrently running partitions are more reliable for fair attribution (Vamja et al., 29 Jan 2025). This is directly relevant to billing and carbon reporting, because per-partition isolation at the execution level does not imply per-partition measurability at the power level.
System-level efficiency studies reinforce the same theme. On Grace Hopper, static partitioning with MIG can improve throughput and energy relative to serial execution: running seven copies concurrently yields an average throughput improvement of about 1.4×, with NekRS reaching 2.4× and FAISS 2.5×, and an average energy reduction of 26%; the MIG 7×1g configuration reduces energy to about 63% of the serial baseline on average (Schieffer et al., 9 Apr 2026). Yet that work also shows that MIG does not partition the power system, so co-running 1g instances can collectively exceed the 700 W power cap and trigger frequency drops, and that Nvlink-C2C remains shared even when HBM and SMs are partitioned. To bridge the mismatch between coarse slice sizes and application memory footprints, the same study proposes memory offloading over cache-coherent Nvlink-C2C, reporting up to 450 GB/s per direction for the interconnect and direct GPU access to CPU memory of up to about 338 GB/s D2H and 348 GB/s H2D (Schieffer et al., 9 Apr 2026).
A more radical response is to retain MIG’s hardware isolation while abandoning the conventional one-job-per-instance operational model. Flex-MIG argues that the usual one-to-one allocation model amplifies over-provisioning, internal and external fragmentation, and drain-required reconfiguration. It therefore adopts a one-to-many model in which one job can span multiple MIG instances on the same host, using modified NCCL peer discovery and host shared-memory collectives. On the reported traces, Flex-MIG lowers average waiting time by about 11% relative to Dynamic-MIG, improves makespan by up to 17%, and avoids the repeated reconfiguration cost that conventional one-to-one schedulers incur (Kim et al., 12 Nov 2025). This does not weaken MIG isolation; rather, it shifts composition from the hardware allocator into the orchestration and runtime layers.
Taken together, the literature portrays MIG as a mature hardware primitive whose effectiveness depends on surrounding control logic. It is valuable because it provides hard spatial isolation, predictable slices, and a viable substrate for cloud multi-tenancy. It is difficult because legal partitions are few, start indices matter, reconfiguration is not free, several critical resources remain shared, and whole-GPU measurements do not decompose cleanly to slices. The resulting research agenda has therefore converged on a broader definition of “using MIG”: not merely creating instances, but profiling them, selecting them online, accounting for them, and, increasingly, composing them with C2C memory extension, FPGA preprocessing, or software-level aggregation across leaves (Li et al., 2022, Luo et al., 19 May 2026, Kim et al., 12 Nov 2025).