Dice Question Streamline Icon: https://streamlinehq.com

Cause of GloVe-25 sensitivity to the CrackIVF min_pts parameter

Determine the underlying reason that increasing the CrackIVF heuristic parameter min_pts from 2 to 32 improves the Queries-Per-Second vs. Recall trade-off on the GloVe-25 dataset. Ascertain whether the improvement is primarily due to the 25-dimensional embeddings causing slower growth of the build-operations budget and thus excessive local imbalances at min_pts=2, or whether other factors—such as convergence to a smaller final number of partitions or the dataset’s query distribution—are responsible.

Information Square Streamline Icon: https://streamlinehq.com

Background

CrackIVF employs heuristic rules to decide where and when to apply two build operations (CRACK and REFINE). One such heuristic parameter, min_pts, specifies the minimum number of points that must be stolen by a new crack to be buffered as a candidate partition. Across most datasets tested, the default min_pts=2 works well, but on the GloVe-25 dataset (25-dimensional embeddings) the authors observed notable performance improvements (QPS-Recall) when min_pts was increased to 32.

The paper proposes two hypotheses for this dataset-specific sensitivity: first, that low dimensionality reduces search time and slows accumulation of the build budget, leaving too many local imbalances when min_pts=2; second, that the improved run converges to a smaller final partition count, though the authors suspect the uniform query distribution and dataset scale mean partition count is not the limiting factor. The precise cause remains unresolved.

References

Although we can not make a definite statement, we hypothesize that this can be attributed to the fact that this dataset only has 25-dimensional embeddings.

Cracking Vector Search Indexes (2503.01823 - Mageirakos et al., 3 Mar 2025) in Section 5 (Experiments), Control Mechanisms Ablation Study — Varying heuristic rule parameters