Dice Question Streamline Icon: https://streamlinehq.com

Unexplained ORHR_COL slowdown on AMD CPUs

Identify and explain the causes of the observed slow performance of LAPACK’s ORHR_COL routine on the AMD EPYC 9734 platform in the reported experiments, determining whether the bottleneck arises from the ORHR_COL algorithm itself, the vendor library implementation used (Intel MKL on AMD hardware), or hardware‑specific factors.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the CPU runtime breakdowns, the authors observe that certain components behave differently on Intel and AMD systems. While they offer plausible reasons for why sequential column permutation may appear slower on AMD (e.g., thread count and BLAS implementation), they explicitly state that they lack an explanation for the performance of ORHR_COL on AMD.

ORHR_COL reconstructs Householder vectors from an explicit economical orthogonal factor, and is invoked when using Cholesky QR within BQRRP. Understanding its performance disparity is important for diagnosing bottlenecks and guiding optimizations or alternative implementations.

References

We do not have an explanation for the slow performance of ORHR_COL on the AMD system.

Anatomy of High-Performance Column-Pivoted QR Decomposition (2507.00976 - Melnichenko et al., 1 Jul 2025) in Section 5 (CPU performance breakdown, after Figure 6)