SIMD Compression and the Intersection of Sorted Integers

Published 24 Jan 2014 in cs.IR, cs.DB, and cs.PF | (1401.6399v13)

Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

Abstract PDF Upgrade to Chat

Citations (93)

View on Semantic Scholar

Summary

The paper introduces an SIMD-based compression method achieving 0.7 CPU cycles per 32-bit integer with competitive compression ratios.
It presents novel intersection algorithms, including the SIMD Galloping technique that nearly doubles conjunctive query processing speeds.
Experimental validation on TREC collections confirms substantial performance improvements, enhancing search engine and database responsiveness.

SIMD Compression and the Intersection of Sorted Integers

The paper "SIMD Compression and the Intersection of Sorted Integers" by Lemire, Boytsov, and Kurz addresses the optimization of integer compression and intersection in computer systems, leveraging the capabilities of SIMD (Single Instruction, Multiple Data) instructions. This research is positioned within the context of inverted indexes and database systems where efficient integer processing is critical for performance.

Key Contributions

The primary focus is on enhancing the speed of integer compression schemes through SIMD technology, specifically using the S4-BP128-D4 scheme. This scheme achieves a decompression speed of as little as 0.7 CPU cycles per 32-bit integer while maintaining competitive compression ratios. A key finding is the integration of bit unpacking with differential coding to streamline the process and reduce the overhead from multiple passes over data blocks.

Algorithmic Innovations

The paper proposes new SIMD-based intersection algorithms. The SIMD Galloping algorithm is highlighted, which allows simultaneous comparison of multiple integer pairs, significantly speeding up conjunctive query processing. Another major contribution is achieving up to double the speed of a state-of-the-art approach when processing conjunctive queries using SIMD instructions.

Experimental Validation

Experiments conducted on TREC text collections (GOV2 and ClueWeb09) demonstrate the practical impact of these optimizations. The paper reports that the SIMD-optimized techniques can achieve substantial speed improvements in index intersecting operations without compromising compression effectiveness.

Implications and Future Directions

Theoretical implications suggest that further improvements in SIMD technology could continue to enhance data processing speeds. Practically, these findings can be applied to improve the responsiveness of search engines and databases. Future research might explore the potential of emerging SIMD instruction sets (e.g., AVX2, AVX-512) to push the boundaries of integer compression and intersection efficiency further.

Conclusion

The research provides a valuable and rigorous evaluation of SIMD capabilities for integer list processing. The results underscore the importance of optimizations at both the algorithmic and hardware levels in achieving superior performance in data-intensive applications. These advances point towards more efficient querying processes in large-scale data systems, setting a benchmark for future computational improvements in database technologies.

Markdown