Robustness of BSA’s learned error bounds under distribution shift

Establish whether the learned per-dimension error-bound models used in BSA (linear regressions trained at preprocessing time on PCA-projected dimensions) remain effective when the vector collection undergoes distribution shifts between the data used for training and the data encountered at query time.

Background

BSA follows ADSampling by projecting vectors with PCA and evaluating pruning via error quantiles. It further proposes a learned approach that fits a linear regression model for each dimension to estimate error bounds, thereby avoiding manual significance tuning.

The paper notes that this learned approach incurs significant preprocessing cost and raises concerns about robustness when the data distribution changes between training and deployment. Determining the effectiveness of these learned bounds under distribution shift is essential to understand their reliability in real-world, dynamic datasets.

References

However, it is expensive, as a model has to be trained for every dimension in the collection, and their effectiveness has yet to be proven under distribution shifts in the collection.

— PDX: A Data Layout for Vector Similarity Search (2503.04422 - Kuffo et al., 6 Mar 2025) in Section 2.3, The Power of Pruning

Robustness of BSA’s learned error bounds under distribution shift

Background

References

Related Problems