Effective accumulation for positional population count on variable-length vector extensions (e.g., AArch64/SVE)

Develop an effective implementation of the intermediate accumulation step (reducing the a16 bit-vector into counter vectors) and the final accumulation step (transposing and reducing the bit-vectors (a8, a4, a2, a1) into the counters) in the Clausecker–Lemire–Schintke positional population count algorithm for variable-length vector architectures such as AArch64 Scalable Vector Extension (SVE), where the hardware vector length varies across implementations.

Background

The paper presents a fast SIMD algorithm for positional population counts based on carry-save adder (CSA) networks with two key accumulation phases: an intermediate accumulation that reduces the high-weight bit-vector a16 into counter vectors, and a final accumulation that transposes and reduces the four bit-vectors (a8, a4, a2, a1) into counters.

For fixed-width SIMD (AVX2/AVX-512/ASIMD), the authors design vector-length–specific transposition/reduction schedules. However, for variable-length vector extensions (e.g., AArch64/SVE, RISC-V/RVV), the native vector length is a microarchitectural parameter, which complicates designing a single efficient accumulation procedure.

The authors attempted an AArch64/SVE implementation and note difficulties in making the intermediate and final accumulation steps effective across varying vector lengths, outlining two unsatisfactory approaches: either writing separate accumulation procedures for each possible vector length or operating on 128-bit chunks with additional complexity. They also point out missing SVE instructions (e.g., bsl/bit/bif) that further hinder performance.

References

An implementation was attempted for AArch64/SVE, but problems quickly became apparent: while CSA schedule and head processing are very straightforward to implement, it is not clear to the authors how the intermediate and final accumulation steps can be carried out effectively.

— Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD (2412.16370 - Clausecker et al., 20 Dec 2024) in Discussion, Subsubsection “Variable-length Vectors”

Effective accumulation for positional population count on variable-length vector extensions (e.g., AArch64/SVE)

Sponsor

Background

References

Related Problems