Balanced Learned Sort: a new learned model for fast and balanced item bucketing (2407.00734v2)
Abstract: This paper aims to better understand the strengths and limitations of adopting learned-based approaches in sequential sorting numerical data, via two main research steps. First, we study different learned models for distribution-based sorting, starting from some known ones (i.e., two-layer RMI or simple linear models) and then introducing some novel models that either improve the two-layer RMI or are fully new in their algorithmic structure thus resulting space efficient, monotonic, and very fast in building balanced buckets. We test those models over 11 synthetic datasets drawn from different distributions of 200M 64-bit floating-point items, so deriving hints about their ultimate performance and usefulness in designing a sorting algorithm. Based on these findings, we select and plug the best models from above in a new learned-based algorithmic scheme and devise three new sorters that we will test against other 6 sequential sorters (5 classic and 1 learned, known and new ones) over 33 datasets (11 synthetic and 22 real), whose size will be up to 800M items. Our experimental figures will show that our learned sorters achieve superior performance on 31 out of all 33 datasets (synthetic and real). In conclusion, these experimental results provide, on the one hand, a comprehensive answer to the main question: Which algorithmic structure for distribution-based sorting is suited to leverage a learned model in order to achieve efficient performance? and, on the other hand, they leave open several other research and engineering questions about the design of a highly performing sequential sorter that is robust over different input distributions.