Cause of 16×2 kernel outperforming 8×5 despite higher predicted memory operations
Determine the factors that cause the m_r = 16, k_r = 2 kernel in the proposed register-reuse algorithm for applying sequences of planar rotations to a matrix to achieve higher performance than the m_r = 8, k_r = 5 kernel, even though Equation (kernelmemops) predicts that the m_r = 16, k_r = 2 kernel requires almost twice as many memory operations. Identify the conditions under which this performance discrepancy occurs and explain the underlying reasons for it.
References
It is also noteworthy that according to Equation~eq: kernelmemops, the m_r = 16, k_r = 2 kernel needs almost twice as many memory operations as the m_r = 8, k_r = 5 kernel. We do not currently have a satisfying explanation as to why it is still faster.
eq: kernelmemops: