Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 193 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Foundations of Vector Retrieval (2401.09350v1)

Published 17 Jan 2024 in cs.DS and cs.IR

Abstract: Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research.

Citations (4)

View on Semantic Scholar

Summary

The paper outlines key vector retrieval algorithms including branch-and-bound, LSH, graph, and clustering methods.
It reveals that real-world high-dimensional data often lies on a low-dimensional manifold, reducing retrieval complexity.
The study demonstrates that compression techniques, such as quantization and sketching, enhance storage and speed in vector databases.

Vector Retrieval Essentials

Vector retrieval is a fundamental problem in machine learning involving the search for the most similar data points to a query within a large dataset. At its core, the process involves representing objects—whether they be text, images, or other data modalities—as vectors in a space where similar items cluster together. The relevance of vector representations to a query is then evaluated using similarity metrics such as Euclidean distance, inner product, or cosine similarity. The overarching goal is to efficiently and effectively retrieve the top-k most similar vectors in response to a query, which is critical in many online systems from search engines to recommendation algorithms.

Unpacking High Dimensionality

The move towards higher-dimensional data representations presents both opportunities and challenges for vector retrieval. While high dimensionality can lead to more expressive representations, it also complicates retrieval due to the "curse of dimensionality", where vectors become uniformly distant from each other. However, the monograph reveals a silver lining: real-world data often resides near a low-dimensional manifold within the high-dimensional space, suggesting that intrinsic dimensionality is more relevant than the ambient space's dimensionality. Understanding this intrinsic dimensionality is key to improving retrieval methods.

Diverse Approaches to Vector Retrieval

There are four main classes of algorithms designed to address the vector retrieval problem outlined in the monograph:

Branch-and-Bound Algorithms: These create a hierarchical structure over the vector space to guide the search process.
Locality-Sensitive Hashing (LSH): LSH maps vectors to buckets such that similar vectors are more likely to be in the same bucket, reducing the search space.
Graph-Based Algorithms: Implementing a graph structure where vectors are nodes, the search navigates through interconnected vectors towards the best match.
Clustering-Based Algorithms: Vectors are pre-processed into clusters, and the search is conducted within the most relevant clusters.

Each algorithmic approach embodies a unique set of strategies for navigating the complexities of high-dimensional search problems.

Compression and Storage Efficiency

In addition to retrieval speed, another critical concern is reducing the footprint of vector databases through compression techniques. This monograph explores two such methods: quantization and sketching. Quantization clusters vectors and represents them compactly to economize on storage space, while sketching reduces dimensionality by preserving specific properties of vectors. These methods underpin the efficiency of vector databases not just in retrieval time but also in resource management.

Strategic Aims and Reader Engagement

This work is not a comprehensive survey of every algorithm in the field but an exploration of pivotal ideas and methodologies that have significantly impacted vector retrieval. Aimed at graduate students and researchers, the monograph fosters a deeper understanding of the theoretical underpinnings of various retrieval algorithms, setting the stage for further research and innovation in this ever-evolving domain of paper.