Dice Question Streamline Icon: https://streamlinehq.com

Cause of 16×2 kernel outperforming 8×5 despite higher predicted memory operations

Determine the factors that cause the m_r = 16, k_r = 2 kernel in the proposed register-reuse algorithm for applying sequences of planar rotations to a matrix to achieve higher performance than the m_r = 8, k_r = 5 kernel, even though Equation (kernelmemops) predicts that the m_r = 16, k_r = 2 kernel requires almost twice as many memory operations. Identify the conditions under which this performance discrepancy occurs and explain the underlying reasons for it.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces a new register-reuse kernel for efficiently applying sequences of planar rotations to a matrix and analyzes the memory operations required by different kernel configurations. According to the derived memory-operation model (Equation (kernelmemops)), a kernel with m_r = 16 and k_r = 2 should require almost twice as many memory operations as a kernel with m_r = 8 and k_r = 5, suggesting the latter might be faster.

However, experimental results show that the m_r = 16, k_r = 2 kernel performs slightly better than the m_r = 8, k_r = 5 kernel. The authors explicitly state they lack a satisfying explanation for this observation, highlighting an unresolved question about which architectural or algorithmic effects dominate and lead to the unexpected performance outcome.

References

It is also noteworthy that according to Equation~eq: kernelmemops, the m_r = 16, k_r = 2 kernel needs almost twice as many memory operations as the m_r = 8, k_r = 5 kernel. We do not currently have a satisfying explanation as to why it is still faster.

eq: kernelmemops:

(2kr+2nb+2mr)mb(nbkb)kb memory operations.(\frac{2}{k_r} + \frac{2}{n_b} + \frac{2}{m_r} )m_b(n_b-k_b)k_b \text{ memory operations.}

Communication efficient application of sequences of planar rotations to a matrix (2412.01852 - Steel et al., 29 Nov 2024) in Subsection “Selecting kernel size”, Section “Experiments”