Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

EcoVector: Mobile Vector Search

Updated 3 July 2025

EcoVector is a memory- and power-efficient vector search algorithm designed for on-device retrieval-augmented generation using a novel two-stage index structure.
It leverages a RAM-resident centroid graph paired with disk-stored cluster graphs to perform scalable approximate nearest neighbor searches under strict resource constraints.
The algorithm supports real-time index updates and on-device privacy, making it ideal for mobile knowledge management, secure smart assistants, and edge applications.

EcoVector is a memory- and power-efficient vector search algorithm specifically architected for highly constrained, on-device retrieval-augmented generation (RAG) pipelines on mobile hardware. Introduced as the core retrieval component of the MobileRAG system, EcoVector enables scalable approximate nearest neighbor (ANN) search over large text and document embedding datasets—which is typically prohibitive on mobile platforms due to strict RAM, storage, and energy limitations—by tightly coupling cluster-based graph indexing with disk-resident structures and precise memory management.

1. Architectural Principles and Search Workflow

EcoVector departs from conventional ANN approaches (IVF, IVF-PQ, HNSW, or disk-based variants) by implementing a two-stage index structure that partitions the vector database for efficient RAM–disk usage:

Cluster Partitioning: The database of embeddings $\{\mathbf{v}_i\}_{i=1}^N$ is partitioned into $N_c$ clusters (e.g., via k-means), each with centroid $\mu_j \in \mathbb{R}^d$ .
Centroid Graph in RAM: A lightweight HNSW (Hierarchical Navigable Small World) index is built over the centroids and remains memory-resident. This graph is small (as $N_c \ll N$ ) and supports efficient, low-memory search for the nearest clusters to any query.
Cluster (Inverted List) Graphs in Storage: For each cluster, a separate HNSW subgraph is built over the vectors in that cluster ("inverted list") and stored on disk. These are not loaded until needed.

The search procedure is as follows:

RAM Stage: For each incoming query vector $\mathbf{q}$ , search the RAM-resident centroid HNSW to find the $n_P$ closest centroids.
On-demand Loading: For each such centroid/cluster, load its inverted-list subgraph from storage into RAM as needed (partial, not whole database load).
Cluster-Local Search: Query the newly loaded HNSW subgraph(s) for nearest items to $\mathbf{q}$ .
Merging: Aggregate and rerank the retrieved nearest neighbors across clusters.

This design ensures that at any time, memory usage is bounded by the centroid graph and only the (small) portion of the database relevant to the current query, avoiding the large RAM spikes typical of traditional ANN indices.

2. Memory, Latency, and Power Analysis

EcoVector's memory complexity is given by: $4N_c \left(d + \frac{M'}{1-p_0}\right) + 8N + 4\left(d + \frac{M'}{1-p_0}\right)$

$N_c$ : number of clusters (centroids)
$d$ : embedding dimension
$N$ : dataset size
$M'$ : HNSW neighbor parameter
$p_0 = 1 / \ln M$ : HNSW connection probability

Latency is decomposed into CPU time spent on centroid/cluster search ( $t_s$ ) and disk I/O time for subgraph loads ( $t_d$ ): $T_{\text{search}} = t_s + t_d$ with

$t_s = n_{\text{search}} \cdot t_{\text{op}}$

$t_d = n_{\text{seek}} \cdot (T_{\text{seek}} + T_{\text{cmd}}) + n_{\text{byte}} \cdot T_{\text{transfer}}$

Power is estimated as: $E \approx V \cdot \left[I(t_s) \cdot t_s + I(t_d) \cdot t_d\right]$ where $V$ is device voltage and $I(\cdot)$ the average current for CPU/disk during respective stages.

Experimental results on contemporary mobile silicon (e.g., Galaxy S24) demonstrate dramatic improvements over both RAM- and disk-based server-optimized ANN baselines: up to 10–54% lower memory, 10–41% reduced latency (Time To First Token, TTFT), and up to 40% less power per query, even on datasets with over $10^5$ embeddings.

3. Incremental Updates and Robustness

EcoVector supports efficient, in-place insertion and deletion of individual vectors (nodes) from HNSW subgraphs, using algorithms that maintain graph bidirectionality and connectivity (Algorithm 1/2 in the referenced work). Specifically, insertions first attempt "bidirectional greedy insertion" into active clusters and fall back to graph update if the cluster is full; deletions are handled by updating all incident links in the subgraph, ensuring the cluster remains fully navigable.

This allows for continuous, real-time indexing as documents are added or deleted—essential for on-device privacy (local-only) operation and for dynamic user content.

4. Selective Content Reduction (SCR): Enabling Efficient RAG

Within MobileRAG, EcoVector is paired with Selective Content Reduction (SCR) to further optimize downstream use of small LLMs (sLMs):

Sentence segmentation: Each retrieved document's text is split into sentences.
Chunking: Overlapping windows of 3–5 sentences are formed from each document.
Similarity scoring: Each chunk is embedded and scored for semantic similarity to the query embedding.
Top-N selection with context: The highest-scoring chunk(s) per document are selected, and context is preserved by including adjacent sentences.
Prompt assembly: Only these condensed, highly relevant segments form the LM prompt, with re-ranking for maximum relevance.

Impact: Token counts are reduced by 7–42%, directly leading to lower inference cost for the sLM step and lower end-to-end TTFT, all while maintaining answer accuracy. SCR’s context-aware design avoids the context fragmentation penalties encountered in naive "small chunk" RAG.

5. Empirical Performance

MobileRAG (EcoVector + SCR) outpaces baseline on-device and edge-serving RAG pipelines:

Metric	Naive-RAG	EdgeRAG	Advanced RAG	MobileRAG (EcoVector+SCR)
Search Latency (Qwen2.5)	6.79s	6.79s	7.17s	5.01s
Power per Query (J)	32.72	32.75	34.43	24.71
TTFT Improvement	—	—	—	10–41% faster
Battery per 1k tokens	0.10%	—	—	0.10%

For detailed dataset-by-dataset breakdowns (SQuAD, TriviaQA, HotpotQA), MobileRAG consistently uses less RAM, less energy, and achieves faster response than both server-optimized and existing "edge" approaches, with no statistically significant loss in MRR/accuracy.

6. Privacy, Security, and Offline Capability

EcoVector is designed to guarantee on-device privacy:

The entire search, SCR, and LM inference run locally—no network communication occurs.
Sensitive user content (documents, photos, messages) never leaves the device.
Index updates and deletions (including sensitive data removal) are handled in-RAM plus direct disk mutation, with no cloud involvement.
Offline operation is fully supported, making the system robust against connectivity lapses and appropriate for regulatory or sensitive verticals (health, enterprise, legal, finance).

7. Application Domains and Research Significance

EcoVector’s techniques establish a new baseline for on-device deployed vector search:

Mobile Knowledge Management: Secure, privacy-preserving search and summarization of notes, files, emails, and conversations.
Smart Assistants: Personalized, fast RAG for daily productivity and question answering, fully offline and trustable.
Sensitive verticals: Supports health, financial, educational, and legal use cases where data residency and privacy are paramount.
IoT and Edge Devices: Model is extensible to wearables, industrial gateways, and on-premise appliances where RAM/energy are scarce.
Research and Development: The RAM–disk cluster-HNSW architecture and techniques for subgraph management represent a transferable blueprint for ANN under resource constraints.

In conclusion, EcoVector provides a technically rigorous, empirically validated approach for enabling fast and energy-efficient vector search on mobile and edge hardware. When paired with Selective Content Reduction, the MobileRAG pipeline delivers seamless, privacy-respecting retrieval-augmented generation suitable for next-generation device-integrated AI systems.

PDF Markdown Chat (Upgrade)