Vectorized and performance-portable Quicksort (2205.05982v1)

Published 12 May 2022 in cs.IR and cs.DC

Abstract: Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed 'vqsort' algorithm integrates into the state-of-the-art parallel sorter 'ips4o', with a geometric mean speedup of 1.59. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces vqsort, a vectorized Quicksort that achieves a geometric mean speedup of 1.59 and supports seven distinct instruction sets.
It employs hardware-specific intrinsics, robust pivot sampling, and compact sorting networks to optimize performance and mitigate adversarial inputs.
The algorithm demonstrates exceptional performance portability across x86, Arm, and RISC-V platforms, making it ideal for high-performance sorting applications.

Vectorized and Performance-Portable Quicksort

The paper "Vectorized and Performance-Portable Quicksort," authored by Jan Wassenberg, Mark Blacher, Joachim Giesen, and Peter Sanders, delineates the development of a novel Quicksort algorithm named vqsort. This algorithm leverages vector instruction sets to achieve unprecedented speed and compatibility across platforms, potentially rendering previous Quicksort implementations obsolete in high-performance sorting applications.

Core Contributions

The paper identifies several deficiencies in existing Quicksort implementations, predominantly their single-threaded nature and specificity to particular instruction sets. To address these, the researchers introduced vqsort, which integrates with the parallel sorter ips4o and demonstrates a geometric mean speedup of 1.59. The algorithm is crafted to operate across seven distinct instruction sets, including advanced architectures such as SVE and RISC-V V.

Key contributions of the article are:

Generality: Supports various data types including 16-128 bit integers and floating-point keys and can operate in either ascending or descending order.
Performance Portability: The consistent performance across diverse architectures, such as x86, Arm, and RISC-V, offers pragmatic use in heterogeneous computing environments.
Engineering Innovations: Introduction of compact and transpose-free sorting networks suitable for sorting small arrays and a vector-friendly pivot sampling method that is robust against adversarial inputs.

Technical Methodology

The work employs the Highway library, which provides platform-specific intrinsics that ensure vector instructions are efficiently implemented across different architectures. The algorithm remains performant by leveraging vector instructions, which facilitate high throughput and reduced energy consumption via reduced execution cycles.

Some of the technical components include:

Partitioning: Improves upon prior AVX-512 strategies with enhanced portability and execution speed, utilizing Highway’s CompressStore op.
Pivot Selection: Utilizes a robust sampling strategy, incorporating randomness to prevent adversarial performance degradation, with an efficient fallback to Heapsort ensuring worst-case complexity remains bounded.
Sorting Networks: Employs a vectorized approach to sorting networks, avoiding matrix transposition and enhancing execution efficiency via Bitonic Merge networks.

Numerical Results

The algorithm demonstrates substantial numerical improvements over existing methods:

It surpasses standard library sorting algorithms by up to 20 times in speed for non-tuple keys on CPUs.
On a single-core performance evaluation, vqsort showcases 2.89 times the throughput of ips4o when using 1M 64-bit integers, further demonstrating its efficacy under vectorized conditions.

Implications and Future Directions

The introduction of vqsort has various implications. Practically, it offers a substantial performance gain for applications that require intensive sorting operations, particularly in database and information retrieval sectors. Theoretically, it presents a template for future research into vectorized algorithms, encouraging exploration of SIMD/utilization within other classic algorithms.

Moving forward, the research community may see vqsort's approach expanded upon, such as incorporating custom comparators in vectorized contexts or improving portability for other data structures like tuples and complex objects. Further empirical evaluation on newly emerging hardware platforms is essential to validate and enhance the algorithm's adaptability and efficiency.

In sum, the vqsort algorithm significantly advances the field of sorting algorithms by bridging the gap between performance and portability, positioning itself as a leading method for high-efficiency sorting across varying computing architecture landscapes.

PDF Markdown

Related Papers

GitHub

HackerNews

Highway: C++ library that provides portable SIMD/vector intrinsics (6 points, 2 comments)
Vectorized and performance-portable Quicksort (2022) (2 points, 0 comments)
Highway – Portable SIMD Library (2 points, 1 comment)