Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parallel Prefix Sum with SIMD (2312.14874v1)

Published 22 Dec 2023 in cs.DC

Abstract: The prefix sum operation is a useful primitive with a broad range of applications. For database systems, it is a building block of many important operators including join, sort and filter queries. In this paper, we study different methods of computing prefix sums with SIMD instructions and multiple threads. For SIMD, we implement and compare horizontal and vertical computations, as well as a theoretically work-efficient balanced tree version using gather/scatter instructions. With multithreading, the memory bandwidth can become the bottleneck of prefix sum computations. We propose a new method that partitions data into cache-sized smaller partitions to achieve better data locality and reduce bandwidth demands from RAM. We also investigate four different ways of organizing the computation sub-procedures, which have different performance and usability characteristics. In the experiments we find that the most efficient prefix sum computation using our partitioning technique is up to 3x faster than two standard library implementations that already use SIMD and multithreading.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Libstdc++ parallel mode. https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html.
  2. Parallel stl. https://github.com/oneapi-src/oneDPL.
  3. Distributed join algorithms on thousands of cores. Proceedings of the VLDB Endowment, 10(5):517–528, 2017.
  4. Efficient stream compaction on wide simd many-core architectures. In Proceedings of the conference on high performance graphics 2009, pages 159–166, 2009.
  5. G. E. Blelloch. Vector models for data-parallel computing, volume 2. MIT press Cambridge, 1990.
  6. G. E. Blelloch. Prefix sums and their applications. Synthesis of Parallel Algorithms, pages 35–60, 1993.
  7. S. Chaudhuri and J. Radhakrishnan. The complexity of parallel prefix problems on small domains. In Proceedings., 33rd Annual Symposium on Foundations of Computer Science, pages 638–647. IEEE, 1992.
  8. R. Cole and U. Vishkin. Faster optimal parallel prefix sums and list ranking. Information and computation, 81(3):334–352, 1989.
  9. P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and experience, 24(3):327–336, 1994.
  10. Relative prefix sums: An efficient approach for querying dynamic olap data cubes. In Proceedings 15th International Conference on Data Engineering (Cat. No. 99CB36337), pages 328–335. IEEE, 1999.
  11. T. Goldberg and U. Zwick. Optimal deterministic approximate parallel prefix sums and their applications. In Proceedings Third Israel Symposium on the Theory of Computing and Systems, pages 220–228. IEEE, 1995.
  12. Parallel prefix sum (scan) with cuda. GPU gems, 3(39):851–876, 2007.
  13. Paths to fast barrier synchronization on the node. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pages 109–120, 2019.
  14. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986.
  15. Range queries in olap data cubes. ACM SIGMOD Record, 26(2):73–88, 1997.
  16. Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment, 2(2):1378–1389, 2009.
  17. Extending openmp* with vector constructs for modern multicore simd architectures. In International Workshop on OpenMP, pages 59–72. Springer, 2012.
  18. K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems, 24(8):1602–1612, 2012.
  19. Parallel prefix computation. Journal of the ACM (JACM), 27(4):831–838, 1980.
  20. Gpu sample sort. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–10. IEEE, 2010.
  21. D. Lemire. Wavelet-based relative prefix sum methods for range sum queries in data cubes. In Proceedings of the 2002 conference of the Centre for Advanced Studies on Collaborative research, page 6. IBM Press, 2002.
  22. D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1–29, 2015.
  23. Simd compression and the intersection of sorted integers. Software: Practice and Experience, 46(6):723–749, 2016.
  24. An evaluation of vectorizing compilers. In 2011 International Conference on Parallel Architectures and Compilation Techniques, pages 372–382. IEEE, 2011.
  25. Rethinking simd vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1493–1508, 2015.
  26. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG), 3(4):43–es, 2007.
  27. Effective barrier synchronization on intel xeon phi coprocessor. In European Conference on Parallel Processing, pages 588–600. Springer, 2015.
  28. P. Sanders and J. L. Träff. Parallel prefix (scan) algorithms for mpi. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 49–57. Springer, 2006.
  29. Designing efficient sorting algorithms for manycore gpus. In 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1–10. IEEE, 2009.
  30. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 351–362, 2010.
  31. Scan primitives for gpu computing. 2007.
  32. A work-efficient step-efficient prefix sum algorithm. 2006.
  33. J. Singler and B. Konsik. The gnu libstdc++ parallel mode: software engineering considerations. In Proceedings of the 1st international workshop on Multicore software engineering, pages 15–22, 2008.
  34. Mcstl: The multi-core standard template library. In European Conference on Parallel Processing, pages 682–694. Springer, 2007.
  35. Streamscan: fast scan algorithms for gpus without global barrier synchronization. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 229–238, 2013.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com