Effective accumulation for positional population count on variable-length vector extensions (e.g., AArch64/SVE)

Develop an effective implementation of the intermediate accumulation step (reducing the a16 bit-vector into counter vectors) and the final accumulation step (transposing and reducing the bit-vectors (a8, a4, a2, a1) into the counters) in the Clausecker–Lemire–Schintke positional population count algorithm for variable-length vector architectures such as AArch64 Scalable Vector Extension (SVE), where the hardware vector length varies across implementations.

Background

The paper presents a fast SIMD algorithm for positional population counts based on carry-save adder (CSA) networks with two key accumulation phases: an intermediate accumulation that reduces the high-weight bit-vector a16 into counter vectors, and a final accumulation that transposes and reduces the four bit-vectors (a8, a4, a2, a1) into counters.

For fixed-width SIMD (AVX2/AVX-512/ASIMD), the authors design vector-length–specific transposition/reduction schedules. However, for variable-length vector extensions (e.g., AArch64/SVE, RISC-V/RVV), the native vector length is a microarchitectural parameter, which complicates designing a single efficient accumulation procedure.

The authors attempted an AArch64/SVE implementation and note difficulties in making the intermediate and final accumulation steps effective across varying vector lengths, outlining two unsatisfactory approaches: either writing separate accumulation procedures for each possible vector length or operating on 128-bit chunks with additional complexity. They also point out missing SVE instructions (e.g., bsl/bit/bif) that further hinder performance.

References

An implementation was attempted for AArch64/SVE, but problems quickly became apparent: while CSA schedule and head processing are very straightforward to implement, it is not clear to the authors how the intermediate and final accumulation steps can be carried out effectively.

Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD (2412.16370 - Clausecker et al., 20 Dec 2024) in Discussion, Subsubsection “Variable-length Vectors”