Effective accumulation for positional population count on variable-length vector extensions (e.g., AArch64/SVE)
Develop an effective implementation of the intermediate accumulation step (reducing the a16 bit-vector into counter vectors) and the final accumulation step (transposing and reducing the bit-vectors (a8, a4, a2, a1) into the counters) in the Clausecker–Lemire–Schintke positional population count algorithm for variable-length vector architectures such as AArch64 Scalable Vector Extension (SVE), where the hardware vector length varies across implementations.
Sponsor
References
An implementation was attempted for AArch64/SVE, but problems quickly became apparent: while CSA schedule and head processing are very straightforward to implement, it is not clear to the authors how the intermediate and final accumulation steps can be carried out effectively.
— Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD
(2412.16370 - Clausecker et al., 20 Dec 2024) in Discussion, Subsubsection “Variable-length Vectors”