Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs

Published 17 Feb 2025 in cs.PF and cs.DC | (2502.11906v1)

Abstract: Most modern processors contain vector units that simultaneously perform the same arithmetic operation over multiple sets of operands. The ability of compilers to automatically vectorize code is critical to effectively using these units. Understanding this capability is important for anyone writing compute-intensive, high-performance, and portable code. We tested the ability of several compilers to vectorize code on x86 and ARM. We used the TSVC2 suite, with modifications that made it more representative of real-world code. On x86, GCC reported 54% of the loops in the suite as having been vectorized, ICX reported 50%, and Clang, 46%. On ARM, GCC reported 56% of the loops as having been vectorized, ACFL reported 54%, and Clang, 47%. We found that the vectorized code did not always outperform the unvectorized code. In some cases, given two very similar vectorizable loops, a compiler would vectorize one but not the other. We also report cases where a compiler vectorized a loop on only one of the two platforms. Based on our experiments, we cannot definitively say if any one compiler is significantly better than the others at vectorizing code on any given platform.

Summary

  • The paper quantifies loop vectorization rates and challenges across compilers (GCC, Clang, ICX, ACFL) on x86 and ARM platforms.
  • It shows that vectorization success varies drastically by code pattern, with fewer than one‐third of loops common to all compilers.
  • The study reveals inconsistent performance gains from vectorization, urging explicit benchmarking and manual review for mission-critical HPC code.

Comparative Evaluation of Compiler Vectorization on X86 and ARM Architectures

Abstract

This essay provides an in-depth analysis and synthesis of the paper "Comparison of Vectorization Capabilities of Different Compilers for X86 and ARM CPUs" (2502.11906). The study systematically benchmarks GCC, Clang, Intel's ICX, and ARM's ACFL using a modified TSVC2 loop suite on recent x86 (Intel Xeon(R) Gold 6152) and ARM (A64FX) CPUs, focusing on both the breadth of automatic vectorization and the relative and absolute performance of the resulting machine code. The implications for compiler design, code portability, and high-performance computing are examined.

Experimental Setup

The research uses four modern compilers: GCC 14.1.1, Clang 18.1.8, ICX 2024.0.2, and ACFL 22.2, targeting x86 (with AVX-512) and ARM (with SVE). The TSVC2 suite is adapted for higher realism: all global compile-time information (loop bounds, array sizes) is eliminated, mirroring production HPC settings where arrays are dynamically allocated and loop trips are runtime-dependent. The two hardware platforms, Intel Xeon and Fujitsu A64FX, allow direct comparison of vectorization efficacy and code generation on ISAs with distinct vector unit architectures.

Vectorization Coverage and Cross-Platform Consistency

On x86, GCC reports success for 54% of loops, ICX for 50%, and Clang for 46%. On ARM, the numbers rise slightly: GCC at 56%, ACFL at 54%, and Clang at 47%. Notably, vectorization is far from universal—even on carefully selected benchmarks, a significant fraction of loops are not vectorized by any compiler on either platform. The cross-compiler intersection is limited, with only about one-third of loops vectorized by all compilers and a comparable fraction vectorized by none. Figure 1

Figure 1: Aggregate geometric mean execution times of all compilers on x86, revealing competitive but not dominating behavior among GCC, Clang, and ICX.

A closer examination reveals that, despite similar overall vectorization rates, compilers often diverge in which loops they handle. Loops with conditional branches, non-unit strides, reverse accesses, reductions, and indirect addressing constitute stress points. Several cases were identified where a compiler vectorized one instance of a pattern but failed on a nearly identical one, pointing to architectural idiosyncrasies in their vectorization analyses.

Performance Results and Anomalies

The performance evaluation demonstrates that vectorization, when reported, does not uniformly translate into superior execution speed. For x86, ICX is fastest on 40% of loops, GCC on 39%, and Clang on 21%. On ARM, Clang leads (65%), followed by GCC (22%) and ACFL (13%). The lack of a single superior compiler is underscored—relative performance varies with both microarchitecture and code pattern. Figure 2

Figure 2: Cross-architecture geometric mean execution times, showing that no compiler consistently outperforms others across platforms.

In the subset of loops vectorized exclusively by one compiler, gains are not universal. There are instances where the vectorized path underperforms unvectorized code due to increased register pressure, suboptimal vector instruction selection, or lack of profit-based heuristics. For example, ICX uniquely vectorizes some indirect-access loops on x86 using gather instructions, but these do not deliver speedup, indicating limitations in both hardware gather implementations and profitability models. On ARM, unrolling in Clang sometimes matches or surpasses the performance of other compilers' vectorized code, suggesting that SVE code generation or auto-vectorization infrastructure remains immature in parts. Figure 3

Figure 3: x86 loop execution times where only GCC applies vectorization, illustrating both improved and degraded performance relative to other compilers.

Figure 4

Figure 4: Execution times on x86 for loops vectorized by other compilers but not GCC, exposing cases where vectorization offers marginal or negative benefit.

Vectorized loops involving reductions or complex dependences often see divergent choices: GCC and Clang sometimes attempt partial vectorization, while ICX and ACFL rely on explicit gather/scatter or predicate logic. In some situations, loops are vectorized on x86 but not on ARM or vice versa, suggesting a lack of cross-architecture portability in vectorization analyses despite a uniform source base.

Analysis of Vectorization on Irregular Patterns

A key finding concerns indirect-memory-access (IMA) loops. Across both ISAs, only two of the eight TSVC2 IMA patterns are effectively vectorized by any compiler—by ICX on x86 and GCC/ACFL on ARM. Clang does not utilize relevant gather/scatter hardware support on either platform for these cases. Furthermore, even when gather/scatter instructions are emitted, there is no guarantee of improvement, highlighting the inadequacy of current compilers' profitability and codegen for IMAs and their ongoing challenge in high-performance, memory-bound applications.

Implications for Compiler Development and HPC Portability

The study illustrates that no contemporary compiler demonstrates a clear overall advantage in vectorization ability or resulting performance across both major CPU architectures. The disparities highlight gaps in vectorizer analysis frameworks, especially for non-trivial control flow, reductions, or IMAs, and the absence of fully portable, consistently profitable vectorization strategies across heterogeneous platforms. Proprietary, vendor-tuned compilers (ICX, ACFL) do not universally outperform open-source alternatives.

For end-user HPC developers, these results argue for explicit multi-compiler benchmarking and careful manual review of compiler reports and generated code when performance is mission-critical. For compiler/toolchain researchers, there are evident opportunities for improvement in heuristic design, pattern recognition in loop analyses, and cross-target intermediate representations for robust vectorizability.

Speculation on Future Directions

Automated vectorization for complex control flow, dynamic trip counts, or IMAs remains unsolved. Integrating polyhedral models more deeply into open-source compilers, improving profitability heuristics with hardware performance counter feedback, and using ML-guided vectorization strategies are promising avenues. Evolution on the hardware side (e.g., better gather/scatter throughput) may require further collaboration between architecture and compiler designers. Standardization of vectorization pragmas or annotations could also improve both portability and performance predictability in heterogeneous HPC settings.

Conclusion

A comprehensive, quantitative assessment of four state-of-the-art compilers on contemporary HPC CPUs establishes that automatic vectorization remains incomplete and non-uniform even for well-characterized loop kernels. Vectorization success and realized performance are highly sensitive to loop structure and platform, and even aggressive vendor compilers do not guarantee superior handling. These findings set a clear research agenda for vectorizer improvement, cross-platform portability analysis, and better integration of profitability models within the compiler optimization pipeline (2502.11906).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.