Dice Question Streamline Icon: https://streamlinehq.com

Concrete Design of the Per-Sub-Core L0 Instruction Prefetcher

Ascertain the exact hardware design of the per-sub-core L0 instruction cache prefetcher used in NVIDIA Ampere GPUs, including the specific prefetching algorithm and parameters (e.g., stream buffer depth), its trigger conditions, and interactions with the L1 instruction cache.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper identifies a per-sub-core L0 instruction cache and presents evaluation showing that a simple stream-buffer-based prefetcher significantly improves accuracy and performance, approaching a perfect instruction cache.

While their experiments suggest that a simple stream buffer explains observed behavior (with a likely size of 16 entries), the authors explicitly note they could not confirm the exact prefetcher design implemented by NVIDIA.

References

Although we have not been able to confirm the concrete design used in NVIDIA GPUs, we suspect it is a simple scheme like a stream buffer that prefetches successive memory blocks when a miss occurs.

Analyzing Modern NVIDIA GPU cores (2503.20481 - Huerta et al., 26 Mar 2025) in Section 5.2 (Front-end)