Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

MobileRAG: Efficient On-Device RAG

Updated 3 July 2025

MobileRAG is a methodology that integrates efficient vector retrieval and context reduction for on-device augmented generation.
It employs EcoVector, a hybrid in-memory/disk vector search algorithm, to address mobile memory, latency, and energy challenges.
Selective Content Reduction optimizes LM inputs, ensuring low-latency, privacy-preserving operations suitable for offline mobile applications.

MobileRAG refers to the class of methodologies and frameworks that enable retrieval-augmented generation (RAG) directly on mobile devices, addressing the unique constraints of memory, computational resources, latency, privacy, and energy consumption characteristic of mobile environments. The term encompasses both the design of new vector search and generation algorithms tailored for resource-constrained settings, as well as architectural pipelines that ensure efficient, private, and robust question answering and information retrieval on mobile platforms.

1. Challenges of On-Device RAG in Mobile Settings

Running RAG pipelines on mobile devices introduces technical hurdles not present in server or cloud deployments:

Memory Constraints: Contemporary vector search backends (e.g., HNSW, IVF) require large index structures to reside in RAM. Mobile devices typically provide only 4–12GB of RAM (with 5–6GB available to applications), limiting feasible corpus sizes and leading to frequent out-of-memory errors if large graphs or embeddings are stored.
Power Consumption: Vector search and LLM (LM) inference significantly strain CPUs and sometimes GPUs. Sustained high loads increase battery drain, device heat, and can trigger operating system-level throttling, diminishing user experience.
Latency: Sequential and high-dimensional retrieval plus large LM contexts yield high time-to-first-token (TTFT) and overall response time, undermining interactivity.
Offline and Privacy Requirements: Many applications require all user data to be processed locally for privacy, disallowing even intermittent cloud or external API access.
Update and Scalability Issues: Traditional approaches often necessitate index rebuilds or full local reloading to add or remove documents, resulting in user-unfriendly or impractical maintenance for dynamic personal corpora.

2. The MobileRAG Pipeline: Core Algorithms and Techniques

MobileRAG introduces two principal technical innovations to address these barriers: the EcoVector vector search algorithm and the Selective Content Reduction (SCR) procedure for LM input optimization.

EcoVector: On-Device Vector Search

EcoVector is a vector retrieval system specifically designed for mobile platforms, addressing memory and power constraints as follows:

Cluster Partitioning: Embedding vectors are divided into clusters using k-means; only centroids and minimal cluster metadata reside in RAM.
Centroids Graph (RAM): An HNSW graph is constructed over only the centroids, not the full vector set. This retains high retrieval recall but with a drastically reduced memory footprint.
Per-Cluster Graphs (Disk): For each cluster, a separate HNSW graph covers the vectors in that cluster; these are stored on disk and loaded into RAM only on demand during search.
Partial Loading Search: A query first searches the centroids graph to select likely clusters. Then, only those clusters' graphs are loaded, queried, and unloaded immediately after use. This partitioned RAM/disk strategy maintains responsiveness and enables operation on large corpora.

Memory usage formula:

$\text{EcoVector memory} = 4N_c\left(d + \frac{M'}{1-p_0}\right) + 8N + 4\left(d + \frac{M'}{1-p_0}\right)$

where $N_c$ is the number of centroids, $d$ is embedding dimension, $M'$ is graph connectivity, $p_0$ is probability of a missing node, and $N$ is corpus size.

Search time formula:

$T_{\text{search}} = ef_c \cdot M' + n_P \cdot ef_L \cdot M'$

where $ef_c$ and $ef_L$ are search parameters, and $n_P$ is the number of probed clusters.

Energy usage estimation:

$E \approx V [I(t_s)t_s + I(t_d)t_d]$

where $I(\cdot)$ is current during search ( $t_s$ ) and disk ( $t_d$ ) phases, and $V$ is device voltage.

EcoVector supports online insertions and deletions by confining updates to single per-cluster disk graphs, avoiding full index rebuilds.

Selective Content Reduction (SCR): Efficient Context Construction

SCR is a post-retrieval method for minimizing and optimizing the LM input context:

Sentence Segmentation: Each document is divided into sentences, and sliding sentence windows are created.
Window Similarity: Embeddings are calculated for each window; scores indicate similarity to the query.
Selective Merging: Only top-N windows per document (potentially with adjacent context for coherence) are retained and merged.
Reordering: Final LM prompts present the most relevant content first, effectively implementing an implicit re-ranking.

SCR typically reduces context size by 7–42% (dataset-dependent), resulting in lower LM inference latency and power use, while empirical accuracy is maintained.

3. Empirical Performance and Evaluation

Evaluation on Samsung Galaxy S24 (8GB RAM, Exynos 2400, Android 14) demonstrates:

Memory Consumption: MobileRAG’s RAM usage is on par with disk-based baselines and markedly lower than in-memory approaches (full IVF/HNSW).
Latency: Time-to-first-token improves by 10–41% over baselines on SQuAD, HotpotQA, and TriviaQA.
Power Efficiency: Energy consumption falls by 24–40% relative to alternative on-device RAG methods.
Accuracy: On all QA datasets, accuracy matches or surpasses baselines with explicit re-ranking and chunk pruning.
Update Speed: Incremental index updates are markedly faster than full rebuilds, supporting dynamic use.
Battery Impact: For 1K token queries, battery impact is 0.1–0.36%, indicating suitability for everyday use.

A summary of comparative features:

Aspect	MobileRAG	Existing On-Device RAG
Indexing	Clustered+partitioned graph, partial load	Full in-memory/disk, less efficient
LM Prompting	SCR post-retrieval, context windowing	Full documents/chunks
Memory Footprint	Minimal, loaded per-query	High
CPU/Power	Low, efficient compute/disk balance	Elevated
Latency	10–41% faster TTFT	Slower
Privacy/Offline	Fully local	Usually, but often limited
Scalability	Incremental per-cluster updates	Typically full rebuilds

4. System Implementation and Privacy Features

MobileRAG employs a hybrid in-memory/disk index, leveraging SQLite for embedding and document management. All search, content reduction, and LM prompting occur fully on-device, with no network dependence or cloud interaction, directly enhancing user privacy. Offline operation is thus natively supported.

The architecture circumvents privacy bottlenecks seen in prior server-augmented or thin-client RAG deployments. User queries, search histories, and document corpora remain local, making MobileRAG suitable for sensitive or regulated environments.

5. Practical Implications and Applications

MobileRAG’s structure supports several practical scenarios:

Offline and Privacy-Critical QA: Users can query local corpora (e.g., emails, notes, PDFs) without internet, ensuring data never leaves the device.
Personal Knowledge Base Search: Supports on-device knowledge organization and retrieval at scales previously only feasible in the cloud.
Interactive Mobile Assistants: Real-time, responsive RAG pipelines for summarization, planning, or context-aware help are now approachable with minimal latency and power draw.
Incremental Corpus Maintenance: Additions/deletions to the personal index (e.g., new documents) are efficient and do not require full reindexing.

A plausible implication is that the scalable, modular design of MobileRAG enables future integration with multimodal input (images, speech), further expanding its applicability on next-generation mobile hardware.

6. Future Directions

Several research and development paths follow the deployment of MobileRAG:

Hardware Acceleration: Exploration of dedicated NPU/GPU co-processing to further speed up both vector search and LM inference on mobile SoCs.
Advanced Embedding & Compression: Investigation of highly compressed or quantized embedding methods to enable handling even larger local corpora as device storage and compute continue to improve.
Multimodal Extensions: While the current system focuses on text, extension to multimodal retrieval will depend on advances in efficient mobile-friendly image/audio embedding and reduction techniques.
Developer Tooling and SDKs: The pipeline’s modular design opens the door to SDKs or libraries facilitating broader adoption in commercial mobile applications.

7. Historical Context and Relationship to Prior Work

Early RAG models assumed abundant server memory and power. Edge- and mobile-focused variants initially attempted partial solutions (e.g., partial disk-based storage), but still suffered from resource bottlenecks and poor update efficiency. MobileRAG synthesizes lessons from these approaches, introducing an end-to-end pipeline designed specifically for typical mobile hardware, as evidenced by direct empirical measurements and analytic formulations included in the source work.

References

MobileRAG: A Fast, Memory-Efficient, and Energy-Efficient Method for On-Device RAG (Park et al., 1 Jul 2025)
Comparison data, technical figures, and formulas as documented in Section 3.3 and 5 of the source manuscript.

PDF Markdown Chat (Pro)

References (1)

MobileRAG: A Fast, Memory-Efficient, and Energy-Efficient Method for On-Device RAG (2025)

Follow Topic

Get notified by email when new papers are published related to MobileRAG.

MobileRAG: Efficient On-Device RAG

1. Challenges of On-Device RAG in Mobile Settings

2. The MobileRAG Pipeline: Core Algorithms and Techniques

EcoVector: On-Device Vector Search

Selective Content Reduction (SCR): Efficient Context Construction

3. Empirical Performance and Evaluation

4. System Implementation and Privacy Features

5. Practical Implications and Applications

6. Future Directions

7. Historical Context and Relationship to Prior Work

References

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MobileRAG: Efficient On-Device RAG

1. Challenges of On-Device RAG in Mobile Settings

2. The MobileRAG Pipeline: Core Algorithms and Techniques

EcoVector: On-Device Vector Search

Selective Content Reduction (SCR): Efficient Context Construction

3. Empirical Performance and Evaluation

4. System Implementation and Privacy Features

5. Practical Implications and Applications

6. Future Directions

7. Historical Context and Relationship to Prior Work

References

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research