Concrete Design of the Per-Sub-Core L0 Instruction Prefetcher
Ascertain the exact hardware design of the per-sub-core L0 instruction cache prefetcher used in NVIDIA Ampere GPUs, including the specific prefetching algorithm and parameters (e.g., stream buffer depth), its trigger conditions, and interactions with the L1 instruction cache.
References
Although we have not been able to confirm the concrete design used in NVIDIA GPUs, we suspect it is a simple scheme like a stream buffer that prefetches successive memory blocks when a miss occurs.
— Analyzing Modern NVIDIA GPU cores
(2503.20481 - Huerta et al., 26 Mar 2025) in Section 5.2 (Front-end)