Papers
Topics
Authors
Recent
2000 character limit reached

Streaming Multiprocessors (SMs) Overview

Updated 30 December 2025
  • Streaming Multiprocessors (SMs) are parallel execution units in GPUs that manage registers, shared memory, and thread scheduling to maximize concurrency.
  • They implement resource sharing techniques, such as register and scratchpad sharing, to mitigate fragmentation and enhance occupancy.
  • Analytical models like the Many-core Machine Model (MMM) guide performance optimization by balancing resource constraints and execution speed.

Streaming Multiprocessors (SMs) are the fundamental parallel execution units in modern Graphics Processing Units (GPUs), responsible for orchestrating thousands of concurrent threads and managing hardware resources that enable high-throughput computation. SMs allocate hardware resources such as registers and on-chip shared memory to thread blocks, determine execution occupancy, and mediate latency hiding. Analytical models and resource management strategies for SMs are central to achieving efficient execution of parallel kernels on GPUs. This article provides a comprehensive overview of SM architecture, resource constraints, management methodologies, and their impact on algorithmic performance, with an emphasis on resource sharing and analytic modeling as detailed in (Jatala et al., 2015) and (Haque et al., 2014).

1. SM Hardware Resources and Occupancy Constraints

Each SM encapsulates substantial hardware resources designed to support massive thread concurrency:

  • Registers (RSM): For the evaluated GTX-class device, each SM exposes 32,768 32-bit registers.
  • Scratchpad (Shared Memory, MSM): Per-SM shared memory is set at 16,384 bytes.
  • Thread and Block Limits: An SM supports up to 8 resident thread blocks (CTAMax = 8) and a maximum of 1,536 threads (TMax = 1,536).

Resource allocation is dictated at the thread block granularity. Given a kernel configuration (B threads/block, RperThread registers/thread, MperBlock bytes/block), resource constraints yield upper bounds on block residency:

  • By registers: BlocksReg=RSM/(RperThread×B)\text{BlocksReg} = \lfloor R_{\text{SM}} / (R_{\text{perThread}} \times B) \rfloor
  • By shared memory: BlocksSM=MSM/MperBlock\text{BlocksSM} = \lfloor M_{\text{SM}} / M_{\text{perBlock}} \rfloor
  • By threads: BlocksThr=TMax/B\text{BlocksThr} = \lfloor T_{\text{Max}} / B \rfloor
  • By architecture: BlocksArch=CTAMax\text{BlocksArch} = \text{CTAMax}

The number of resident blocks per SM is thus:

BlocksResident=min(BlocksReg,BlocksSM,BlocksThr,BlocksArch)\text{BlocksResident} = \min(\text{BlocksReg}, \text{BlocksSM}, \text{BlocksThr}, \text{BlocksArch})

Occupancy is expressed as the fraction of threads actually resident over the hardware maximum:

Occupancy=min(RSM/RperThread,MSM/MperBlock,TMax/B,CTAMax)×BTMax\text{Occupancy} = \frac{\min(\lfloor R_{\text{SM}} / R_{\text{perThread}} \rfloor, \lfloor M_{\text{SM}} / M_{\text{perBlock}} \rfloor, \lfloor T_{\text{Max}} / B \rfloor, \text{CTAMax}) \times B}{T_{\text{Max}}}

When only register and memory limits are significant:

nCTA=min(RSM/Rtb,MSM/Mtb)n_{\text{CTA}} = \min(\lfloor R_{\text{SM}} / R_{\text{tb}} \rfloor, \lfloor M_{\text{SM}} / M_{\text{tb}} \rfloor)

Occupancy=nCTA×BTMax\text{Occupancy} = \frac{n_{\text{CTA}} \times B}{T_{\text{Max}}}

with Rtb=RperThread×BR_{\text{tb}} = R_{\text{perThread}} \times B and Mtb=MperBlockM_{\text{tb}} = M_{\text{perBlock}}.

2. Resource Fragmentation and Wastage

Default allocation at the thread block granularity often leads to significant resource wastage. Unfilled registers and shared memory slices cannot be used by additional blocks, resulting in stranded capacity:

  • Register Example: A kernel requiring 36 regs/thread and 256 threads/block (Rtb=9,216R_{\text{tb}} = 9,216), with RSM=32,768R_{\text{SM}} = 32,768, yields BlocksReg=3\text{BlocksReg} = 3 blocks, consuming 27,648 registers and leaving 5,120 registers unused (~15.6% wastage).
  • Shared Memory Example: For Mtb=7,200M_{\text{tb}} = 7,200 bytes and MSM=16,384M_{\text{SM}} = 16,384, BlocksSM=2\text{BlocksSM} = 2 blocks use 14,400 bytes, leaving 1,984 bytes unused (~12.1% wastage).

Empirically, wastage reaches 10–40% of SM register or shared memory capacity for many kernels (Jatala et al., 2015).

3. Resource Sharing Mechanisms in SMs

Pairwise resource sharing addresses fragmentation by pooling resources between block pairs, controlled by a global threshold t(0,1]t \in (0,1] dictating private/shared splits:

Register Sharing

  • Two blocks requiring RtbR_{\text{tb}} each share Rtb(1+t)<2RtbR_{\text{tb}} (1 + t) < 2R_{\text{tb}} registers.
  • Each block receives RtbtR_{\text{tb}} t private registers; remaining Rtb(1t)R_{\text{tb}} (1-t) are pooled and accessed exclusively (owner/non-owner protocol).
  • Hardware additions: per-warp-pair lock, comparator for register access routing, warp scheduling to prioritize owner progress.

Scratchpad (Shared Memory) Sharing

  • Block pair shares (1+t)Mtb(1 + t) M_{\text{tb}} memory; each block holds MtbtM_{\text{tb}} t privately, the rest pooled under lock.
  • Only the owner block accesses the shared range. Non-owner attempts busy-wait until release.
  • Hardware: per-block-pair lock in controller, comparator for address routing.

Resident Block Count Under Resource Sharing

For registers (RR, RtbR_{\text{tb}}, tt):

  • Without sharing: U0=R/RtbU_0 = \lfloor R / R_{\text{tb}} \rfloor unshared blocks.
  • With sharing: partition into UU unshared and SS pairs shared, with constraints:
    • U+S=R/RtbU + S = \lfloor R / R_{\text{tb}} \rfloor
    • URtb+SRtb(1+t)RU \cdot R_{\text{tb}} + S \cdot R_{\text{tb}} (1 + t) \leq R

A plausible implication is increased block residency by reclaiming stranded resources, subject to lock contention effects.

4. Latency Hiding and Execution Scheduling

Resource sharing strategies are complemented by optimizations to further mask execution latencies and mitigate stall cycles. Owner/non-owner scheduling ensures one block completes its shared-region accesses before the other enters, preserving exclusive access and maximizing progress per unit of pooled resource (Jatala et al., 2015). Hardware locking, arbitration, and comparators operate per-pair within the SM, managing access efficiently and reducing idle periods.

5. Analytical Modeling: The Many-core Machine Model (MMM)

The MMM (Haque et al., 2014) abstracts GPU SMs to formalize parallel execution and overheads:

  • SMs: Unbounded identical units, each with local memory (Z), SIMD-style cores.
  • Program DAG: SIMD kernels ({\cal K}) arranged in precedence-enforcing DAG, each decomposed into thread blocks ({\cal B}).
  • Scheduling: Each block scheduled atomically onto any SM, sharing local memory with intra-block threads.

Key performance measures:

  • Work W(B)W(B): Sum of thread local operations.
  • Span S(B)S(B): Maximum thread local operation count.
  • Overhead O(B)=(α+β)UO(B) = (\alpha + \beta) U: Weighted cost of global memory transfers.
  • Aggregated W(P)W(\mathcal{P}), S(P)S(\mathcal{P}), O(P)O(\mathcal{P}) over kernels.

A Graham-Brent-style theorem provides an execution time bound on PP SMs:

TP(N(P)P+L(P))C(P)T_P \leq \left( \frac{N(\mathcal{P})}{P} + L(\mathcal{P}) \right) C(\mathcal{P})

with C(P)=maxBB(P)[S(B)+O(B)]C(\mathcal{P}) = \max_{B \in {\cal B}(P)} [ S(B) + O(B) ] and L(P)L(\mathcal{P}) longest block dependency path.

This model accurately predicts speedup and guides tuning kernel/block parameters (e.g., per-block arithmetic intensity, tile/block sizes) to minimize O(P)O(\mathcal{P}), ideally balancing increased WW or SS against sharply reduced memory transfer overhead.

6. Empirical Validation and Performance Impact

Resource sharing is validated experimentally in (Jatala et al., 2015):

  • Register Sharing: Maximum improvement of 24%, average improvement of 11% across benchmark suites.
  • Scratchpad Sharing: Maximum improvement of 30%, average improvement of 12.5%.

On algorithmic side (Haque et al., 2014):

  • Polynomial division: s=512s = 512 yields 2×\sim 2 \times speedup, consistent with occupancy and overhead model predictions.
  • Radix sort: Overhead minimized when 2sΘ()2^{s} \approx \Theta(\ell); slog2s \approx \log_{2} \ell matches empirical optimal.
  • Euclidean–gcd: s1s \gg 1 (up to 512) halves running time.
  • Polynomial multiplication: Small ss yields optimal performance, as predicted.

Empirical results underscore the criticality of SM resource management and analytical modeling for maximizing throughput and utilization.

7. Significance and Outlook

SMs are central to the architectural and algorithmic efficiency of GPUs. Their hardware limits, resource fragmentation, and mechanisms for resource sharing fundamentally shape achievable occupancy, utilization, and program speedup. Analytical approaches such as the MMM enable principled algorithm design—predicting tradeoffs, guiding parameter selection, and optimizing for minimal overhead and maximal parallel throughput. Future architectural refinements may integrate finer-grained resource allocation, dynamic sharing, and enhanced hardware support for block pairing and synchronization, further improving SM scalability and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Streaming Multiprocessors (SMs).