Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EcoVector: Mobile Vector Search

Updated 3 July 2025
  • EcoVector is a memory- and power-efficient vector search algorithm designed for on-device retrieval-augmented generation using a novel two-stage index structure.
  • It leverages a RAM-resident centroid graph paired with disk-stored cluster graphs to perform scalable approximate nearest neighbor searches under strict resource constraints.
  • The algorithm supports real-time index updates and on-device privacy, making it ideal for mobile knowledge management, secure smart assistants, and edge applications.

EcoVector is a memory- and power-efficient vector search algorithm specifically architected for highly constrained, on-device retrieval-augmented generation (RAG) pipelines on mobile hardware. Introduced as the core retrieval component of the MobileRAG system, EcoVector enables scalable approximate nearest neighbor (ANN) search over large text and document embedding datasets—which is typically prohibitive on mobile platforms due to strict RAM, storage, and energy limitations—by tightly coupling cluster-based graph indexing with disk-resident structures and precise memory management.

1. Architectural Principles and Search Workflow

EcoVector departs from conventional ANN approaches (IVF, IVF-PQ, HNSW, or disk-based variants) by implementing a two-stage index structure that partitions the vector database for efficient RAM–disk usage:

  • Cluster Partitioning: The database of embeddings {vi}i=1N\{\mathbf{v}_i\}_{i=1}^N is partitioned into NcN_c clusters (e.g., via k-means), each with centroid μjRd\mu_j \in \mathbb{R}^d.
  • Centroid Graph in RAM: A lightweight HNSW (Hierarchical Navigable Small World) index is built over the centroids and remains memory-resident. This graph is small (as NcNN_c \ll N) and supports efficient, low-memory search for the nearest clusters to any query.
  • Cluster (Inverted List) Graphs in Storage: For each cluster, a separate HNSW subgraph is built over the vectors in that cluster ("inverted list") and stored on disk. These are not loaded until needed.

The search procedure is as follows:

  1. RAM Stage: For each incoming query vector q\mathbf{q}, search the RAM-resident centroid HNSW to find the nPn_P closest centroids.
  2. On-demand Loading: For each such centroid/cluster, load its inverted-list subgraph from storage into RAM as needed (partial, not whole database load).
  3. Cluster-Local Search: Query the newly loaded HNSW subgraph(s) for nearest items to q\mathbf{q}.
  4. Merging: Aggregate and rerank the retrieved nearest neighbors across clusters.

This design ensures that at any time, memory usage is bounded by the centroid graph and only the (small) portion of the database relevant to the current query, avoiding the large RAM spikes typical of traditional ANN indices.

2. Memory, Latency, and Power Analysis

EcoVector's memory complexity is given by: 4Nc(d+M1p0)+8N+4(d+M1p0)4N_c \left(d + \frac{M'}{1-p_0}\right) + 8N + 4\left(d + \frac{M'}{1-p_0}\right)

  • NcN_c: number of clusters (centroids)
  • dd: embedding dimension
  • NN: dataset size
  • MM': HNSW neighbor parameter
  • p0=1/lnMp_0 = 1 / \ln M: HNSW connection probability

Latency is decomposed into CPU time spent on centroid/cluster search (tst_s) and disk I/O time for subgraph loads (tdt_d): Tsearch=ts+tdT_{\text{search}} = t_s + t_d with

ts=nsearchtopt_s = n_{\text{search}} \cdot t_{\text{op}}

td=nseek(Tseek+Tcmd)+nbyteTtransfert_d = n_{\text{seek}} \cdot (T_{\text{seek}} + T_{\text{cmd}}) + n_{\text{byte}} \cdot T_{\text{transfer}}

Power is estimated as: EV[I(ts)ts+I(td)td]E \approx V \cdot \left[I(t_s) \cdot t_s + I(t_d) \cdot t_d\right] where VV is device voltage and I()I(\cdot) the average current for CPU/disk during respective stages.

Experimental results on contemporary mobile silicon (e.g., Galaxy S24) demonstrate dramatic improvements over both RAM- and disk-based server-optimized ANN baselines: up to 10–54% lower memory, 10–41% reduced latency (Time To First Token, TTFT), and up to 40% less power per query, even on datasets with over 10510^5 embeddings.

3. Incremental Updates and Robustness

EcoVector supports efficient, in-place insertion and deletion of individual vectors (nodes) from HNSW subgraphs, using algorithms that maintain graph bidirectionality and connectivity (Algorithm 1/2 in the referenced work). Specifically, insertions first attempt "bidirectional greedy insertion" into active clusters and fall back to graph update if the cluster is full; deletions are handled by updating all incident links in the subgraph, ensuring the cluster remains fully navigable.

This allows for continuous, real-time indexing as documents are added or deleted—essential for on-device privacy (local-only) operation and for dynamic user content.

4. Selective Content Reduction (SCR): Enabling Efficient RAG

Within MobileRAG, EcoVector is paired with Selective Content Reduction (SCR) to further optimize downstream use of small LLMs (sLMs):

  • Sentence segmentation: Each retrieved document's text is split into sentences.
  • Chunking: Overlapping windows of 3–5 sentences are formed from each document.
  • Similarity scoring: Each chunk is embedded and scored for semantic similarity to the query embedding.
  • Top-N selection with context: The highest-scoring chunk(s) per document are selected, and context is preserved by including adjacent sentences.
  • Prompt assembly: Only these condensed, highly relevant segments form the LM prompt, with re-ranking for maximum relevance.

Impact: Token counts are reduced by 7–42%, directly leading to lower inference cost for the sLM step and lower end-to-end TTFT, all while maintaining answer accuracy. SCR’s context-aware design avoids the context fragmentation penalties encountered in naive "small chunk" RAG.

5. Empirical Performance

MobileRAG (EcoVector + SCR) outpaces baseline on-device and edge-serving RAG pipelines:

Metric Naive-RAG EdgeRAG Advanced RAG MobileRAG (EcoVector+SCR)
Search Latency (Qwen2.5) 6.79s 6.79s 7.17s 5.01s
Power per Query (J) 32.72 32.75 34.43 24.71
TTFT Improvement 10–41% faster
Battery per 1k tokens 0.10% 0.10%

For detailed dataset-by-dataset breakdowns (SQuAD, TriviaQA, HotpotQA), MobileRAG consistently uses less RAM, less energy, and achieves faster response than both server-optimized and existing "edge" approaches, with no statistically significant loss in MRR/accuracy.

6. Privacy, Security, and Offline Capability

EcoVector is designed to guarantee on-device privacy:

  • The entire search, SCR, and LM inference run locally—no network communication occurs.
  • Sensitive user content (documents, photos, messages) never leaves the device.
  • Index updates and deletions (including sensitive data removal) are handled in-RAM plus direct disk mutation, with no cloud involvement.
  • Offline operation is fully supported, making the system robust against connectivity lapses and appropriate for regulatory or sensitive verticals (health, enterprise, legal, finance).

7. Application Domains and Research Significance

EcoVector’s techniques establish a new baseline for on-device deployed vector search:

  • Mobile Knowledge Management: Secure, privacy-preserving search and summarization of notes, files, emails, and conversations.
  • Smart Assistants: Personalized, fast RAG for daily productivity and question answering, fully offline and trustable.
  • Sensitive verticals: Supports health, financial, educational, and legal use cases where data residency and privacy are paramount.
  • IoT and Edge Devices: Model is extensible to wearables, industrial gateways, and on-premise appliances where RAM/energy are scarce.
  • Research and Development: The RAM–disk cluster-HNSW architecture and techniques for subgraph management represent a transferable blueprint for ANN under resource constraints.

In conclusion, EcoVector provides a technically rigorous, empirically validated approach for enabling fast and energy-efficient vector search on mobile and edge hardware. When paired with Selective Content Reduction, the MobileRAG pipeline delivers seamless, privacy-respecting retrieval-augmented generation suitable for next-generation device-integrated AI systems.