Analyzing Modern NVIDIA GPU cores (2503.20481v1)

Published 26 Mar 2025 in cs.AR

Abstract: GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and explaining how GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution. In particular, it reveals how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline. Moreover, it analyses how a simple instruction prefetcher based on a stream buffer fits well with modern NVIDIA GPUs and is likely to be used. Furthermore, we investigate the impact of the register file cache and the number of register file read ports on both simulation accuracy and performance. By modeling all these new discovered microarchitectural details, we achieve 18.24% lower mean absolute percentage error (MAPE) in execution cycles than previous state-of-the-art simulators, resulting in an average of 13.98% MAPE with respect to real hardware (NVIDIA RTX A6000). Also, we demonstrate that this new model stands for other NVIDIA architectures, such as Turing. Finally, we show that the software-based dependence management mechanism included in modern NVIDIA GPUs outperforms a hardware mechanism based on scoreboards in terms of performance and area.

Summary

The paper introduces a multi-pronged reverse engineering approach that uncovers the NVIDIA RTX A6000’s microarchitectural nuances, including a CGGTY scheduler and a simplified register file cache.
The paper employs detailed microbenchmarking and code analysis to quantify key performance metrics such as instruction latency, memory bandwidth, and register read port efficiency.
The paper demonstrates that software-based dependence management and streamlined cache architecture lead to enhanced simulation accuracy, achieving an 18.24% reduction in MAPE compared to prior models.

Introduction

Graphics Processing Units (GPUs) are fundamental for accelerating High-Performance Computing (HPC) tasks, particularly in artificial intelligence, scientific computing, and data analytics. Their massively parallel architecture provides significant speedups over traditional CPUs. However, detailed microarchitectural information about modern GPUs, especially NVIDIA's, is often proprietary, leaving a gap in academic research which frequently relies on outdated models. This gap hinders accurate performance simulation, optimization efforts, and the development of future GPU architectures. The paper "Analyzing Modern NVIDIA GPU cores" (2503.20481) addresses this by reverse engineering contemporary NVIDIA GPU cores to provide an updated understanding of their inner workings, focusing on microarchitectural details and hardware-compiler co-design techniques.

1. Reverse Engineering Methodology

The paper employed a multi-pronged approach to reverse engineer the NVIDIA RTX A6000 GPU, representing a modern architecture. The methodology combined microbenchmarking, code analysis, and hardware experiments to uncover microarchitectural details.

Microbenchmarking: Custom microbenchmarks were developed to isolate and measure specific hardware characteristics:
- Instruction Performance: Measured latency and throughput of various instruction types and sequences.
- Memory Subsystem: Probed the register file, L1 cache, shared memory, and global memory to determine bandwidth, latency, and caching behavior under different access patterns.
- Scheduler Behavior: Crafted kernels with instruction dependencies to infer scheduling policies.
- Register File Analysis: Investigated register access patterns, spilling, reuse distances, and the behavior of the register file cache.
Code Analysis: The CUDA compiler (nvcc) and disassembled SASS (machine code) were analyzed to understand code generation and optimization strategies:
- Compiler Output Examination: Studied generated SASS code to understand instruction selection, register allocation, and memory access patterns.
- Hardware-Compiler Interactions: Identified how the compiler utilizes specific hardware features.
- Driver Analysis: Examined driver code for insights into GPU initialization and management.
Hardware Experiments: Direct measurements were taken using hardware capabilities:
- Performance Counters: Utilized NVIDIA's performance counters (via tools like Nsight Compute) to measure hardware events like instruction issue rates, cache misses, and memory bandwidth utilization.
Tools Used: The process relied on the CUDA Toolkit, NVIDIA Nsight Systems and Nsight Compute for profiling, disassemblers like cuobjdump, and custom Python scripts for data analysis.

2. Key Microarchitectural Discoveries

The reverse engineering process yielded several significant findings about the microarchitecture of modern NVIDIA GPU cores:

Issue Logic and Scheduler Policy: Modern cores use a sophisticated issue logic heavily guided by the compiler. The analysis points towards a Compiler Guided Greedy Then Youngest (CGGTY) issue scheduler policy. Dependencies are managed using control bits embedded in the instruction stream by the compiler, rather than complex hardware scoreboarding. The scheduler prioritizes ready warps (groups of threads) based on factors like dependencies and resource availability, likely favoring younger ready warps after greedy choices. Plausible models for the fetch stage and its scheduler were also developed.
Register File (RF) and Cache: The structure and operation of the register file were analyzed:
- Organization: Registers are banked for parallel access.
- Register File Cache: A cache exists to buffer frequently accessed registers, improving performance and reducing energy consumption. The paper suggests a relatively simple cache design is effective.
- Read Ports: Analysis indicates that a single read port per RF bank, combined with the cache, performs nearly as well as an idealized RF with unlimited ports for typical workloads.
- Operand Collectors: The analysis confirmed the absence of operand collector units, simplifying the execution pipeline model.
Memory Pipeline Features: Details about the memory pipeline were uncovered:
- Load/Store Queues: The sizes of queues managing memory requests were estimated.
- Shared Memory Access: The rate at which sub-cores (SM partitions) can issue requests to shared memory was determined.
- Instruction Latencies: Latencies for various memory instructions were measured, crucial for accurate performance modeling.
Instruction Prefetcher: An instruction prefetcher is vital for high throughput. The findings suggest NVIDIA GPUs likely use a stream buffer-based prefetcher. This mechanism detects sequential instruction access patterns and proactively fetches subsequent instructions into a buffer, reducing fetch latency. Modeling even a naive stream buffer significantly improves simulation accuracy compared to assuming a perfect instruction cache.

3. Software-Based Dependence Management vs. Hardware Scoreboarding

A key finding is the reliance on software-based dependence management in modern NVIDIA GPUs, moving away from traditional hardware scoreboarding.

Hardware Scoreboarding: Tracks instruction dependencies directly in hardware, using scoreboard entries for each register to indicate pending reads/writes. Instructions execute only when operands are ready and destination registers are free. This requires significant hardware area and complexity, especially with many registers and wide instruction issue.
Software-Based Dependence Management (NVIDIA): The compiler analyzes the code and inserts explicit synchronization instructions or control bits within the instruction stream. These mechanisms ensure dependencies are met without continuous hardware monitoring.
- Advantages: Reduces hardware complexity, saving die area and power. This allows dedicating silicon resources to other performance-critical units (e.g., more ALUs, larger caches).
- Trade-offs: Performance relies on the compiler's effectiveness in inserting necessary, but not overly conservative, synchronization.
Findings: The paper (2503.20481) demonstrates that this software-based approach is superior in both area efficiency and performance for modern GPU workloads, indicating that NVIDIA's compiler technology effectively manages dependencies with minimal overhead compared to the cost of hardware scoreboarding.

4. Impact of Register File Cache and Read Ports

The register file cache and the number of read ports significantly influence performance and simulation accuracy. Sensitivity analyses were performed using the simulation model:

Register File Cache Sensitivity: Simulating various cache sizes revealed diminishing returns. Performance improves significantly with a small cache compared to none, but gains plateau beyond a certain size as the cache captures most frequently accessed values. This suggests a moderately sized cache provides a good balance between performance, area, and power.
Read Ports Impact: Simulating different numbers of read ports showed that while more ports generally improve performance (especially for workloads with high register pressure), the benefit diminishes. The analysis indicated that a configuration with a single read port per bank, combined with an effective register file cache, achieves performance close to an ideal configuration with unlimited ports for many applications.
Optimal Configuration: The ideal setup depends on target workloads and design constraints (area, power). Accurate modeling of the cache and read ports is crucial for simulation fidelity. The paper's detailed model significantly reduced simulation errors by capturing these effects.

// Pseudocode illustrating sensitivity analysis loop
for workload in benchmark_suite:
    results[workload] = {}
    for cache_config in cache_configurations:
        for read_port_config in read_port_configurations:
            // Configure simulator
            simulator.set_rf_cache(cache_config)
            simulator.set_rf_read_ports(read_port_config)

            // Run simulation
            execution_cycles = simulator.run(workload)

            // Store results
            results[workload][(cache_config, read_port_config)] = execution_cycles

// Analyze results to find optimal configurations and trade-offs

5. Modeling and Simulation Improvements

Based on the reverse engineering findings, a new GPU core model was developed and integrated into the Accel-sim cycle-accurate simulator.

Model Development: A modular design was used, modeling components like the issue logic (CGGTY policy), register file (including cache), memory pipeline (queues, latencies), and prefetcher (stream buffer) separately.
Integration: The new modules were integrated into Accel-sim, defining new data structures and interactions within the event-driven simulation framework.
Validation: The enhanced simulator was validated against hardware measurements from the NVIDIA RTX A6000 GPU using a suite of benchmarks. Simulation parameters (clock frequency, memory bandwidth, core count) were calibrated.
Accuracy Improvement: The new model achieved an average Mean Absolute Percentage Error (MAPE) of 13.98% in predicting execution cycles compared to the RTX A6000. This represents an 18.24% reduction in MAPE compared to previous state-of-the-art models, demonstrating the significant impact of incorporating the discovered microarchitectural details.

6. Architecture Generalization

The developed model demonstrates portability across different NVIDIA architectures.

Turing Validation: The model was tested against Turing-based GPUs, achieving an average MAPE of 16.5%. The slightly higher error compared to Ampere (RTX A6000) is attributed to known architectural differences.
Model Adjustments: To adapt the model for Turing, key parameters related to the memory hierarchy were adjusted, such as L1 cache line size (128 bytes on Ampere vs. 64 bytes on Turing) and associativity. These adjustments were crucial for maintaining accuracy. Without them, the MAPE on Turing exceeded 25%.

Example Adjustment (C++):

// Adjusting L1 cache line size based on target architecture
#ifdef AMPERE_ARCH
  #define L1_CACHE_LINE_SIZE 128
#elif TURING_ARCH
  #define L1_CACHE_LINE_SIZE 64
#else
  #error "Unsupported Architecture"
#endif

// Usage in cache access logic
int set_index = (address / L1_CACHE_LINE_SIZE) % NUM_L1_SETS;

Future Refinements: Further improvements for Turing could involve modeling its specific hierarchical thread scheduling mechanism more accurately.

7. Implications and Future Work

The findings from "Analyzing Modern NVIDIA GPU cores" (2503.20481) have significant practical implications:

Improved Simulation: The enhanced model provides a more accurate baseline for academic research and industrial development, enabling better performance prediction, bottleneck analysis, and exploration of architectural trade-offs without immediate hardware testing.
Compiler Optimization: Understanding the CGGTY scheduler and software-based dependence management helps compiler developers generate more efficient code tailored to the hardware's capabilities.
Application Tuning: Knowledge of memory pipeline details (latencies, queue sizes, prefetching) allows application developers to optimize memory access patterns and data layouts for better performance.
Future GPU Design: The analysis highlights the effectiveness of certain design choices (e.g., simple RF cache, software dependence management) and provides insights for designing future architectures.

Future Research Directions:

Extend the model to include more details of the memory subsystem (L2 cache, memory controllers).
Investigate alternative instruction scheduling policies and register file designs.
Explore hardware-compiler co-design opportunities further.
Analyze potential security implications of the revealed microarchitectural features.

This work significantly advances the public understanding of modern NVIDIA GPU microarchitecture, providing valuable tools and insights for researchers, developers, and hardware designers working with high-performance GPU computing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Underfox3/status/1905493794130137406

https://twitter.com/charles_irl/status/1906793545937326374

https://twitter.com/betterhn20/status/1919590190496321948

https://twitter.com/elenaneira/status/1918472285553320431

https://twitter.com/betterhn50/status/1919594971956486262

https://twitter.com/secharvesterx/status/1905862571443277852

HackerNews

Analyzing Modern Nvidia GPU Cores (177 points, 37 comments)
Analyzing Modern NVIDIA GPU cores (2 points, 0 comments)

Reddit

Analyzing Modern NVIDIA GPU cores (21 points, 2 comments)
Analyzing Modern NVIDIA GPU cores (8 points, 0 comments)
Analyzing Modern Nvidia GPU Cores (1 point, 1 comment)