Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation (2407.01972v1)

Published 2 Jul 2024 in cs.IR, cs.AI, cs.HC, and cs.LG

Abstract: Retrieval-augmented text generation (RAG) addresses the common limitations of LLMs, such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as personal finance, education, and medicine. To address the pressing need for client-side dense retrieval, we introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments. Developed with modern and native Web technologies, such as IndexedDB and Web Workers, our toolkit leverages client-side hardware capabilities to enable researchers and developers to efficiently search through millions of high-dimensional vectors in the browser. MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping, as demonstrated in our example application RAG Playground. Reflecting on our work, we discuss the opportunities and challenges for on-device dense retrieval. MeMemo is available at https://github.com/poloclub/mememo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Phi-2: The Surprising Power of Small Language Models. (2023). https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
  2. Apple. 2017. Core ML: Integrate Machine Learning Models into Your App. https://developer.apple.com/documentation/coreml
  3. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx
  4. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. (2024). https://doi.org/10.48550/ARXIV.2401.08406
  5. Gordon Bell. 2001. A Personal Digital Store. Commun. ACM 44 (2001). https://doi.org/10.1145/357489.357513
  6. Navigability of Complex Networks. Nature Physics 5 (2009). https://doi.org/10.1038/nphys1130
  7. Personal Information Management with SEMEX. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/1066157.1066289
  8. Harrison Chase. 2022. LangChain: Building Applications with LLMs through Composability. https://github.com/langchain-ai/langchain
  9. What to Do When Search Fails: Finding Information by Association. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1357054.1357208
  10. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). https://www.usenix.org/conference/osdi18/presentation/chen
  11. Challenges of Large Language Models for Mental Health Counseling. arXiv 2311.13857 (2023). http://arxiv.org/abs/2311.13857
  12. The Power of Noise: Redefining Retrieval for RAG Systems. arXiv 2401.14887 (2024). http://arxiv.org/abs/2401.14887
  13. Ryan Dahl. 2009. Node.Js: An Open-Source, Cross-Platform JavaScript Runtime Environment. (2009). https://nodejs.org/en/
  14. The Faiss Library. arXiv 2401.08281 (2024). http://arxiv.org/abs/2401.08281
  15. Gender, Age, and Technology Education Influence the Adoption and Appropriation of LLMs. arXiv 2310.06556 (2023). http://arxiv.org/abs/2310.06556
  16. Facebook. 2013. React: The Library for Web and Native User Interfaces. https://react.dev/
  17. David Fahlander. 2021. Dexie.Js - Minimalistic IndexedDB Wrapper. https://dexie.org/
  18. CoPrompt: Supporting Prompt Sharing and Referring in Collaborative Natural Language Programming. arXiv 2310.09235 (2023). http://arxiv.org/abs/2310.09235
  19. Tiago Forte. 2022. Building a Second Brain: A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential (first atria books hardcover edition ed.).
  20. Eric Freeman and David Gelernter. 1996. Lifestreams: A Storage Model for Personal Data. ACM SIGMOD Record 25 (1996). https://doi.org/10.1145/381854.381893
  21. Approximate Distance-Comparison-Preserving Symmetric Encryption. Cryptology ePrint Archive, Paper 2021/1666. https://eprint.iacr.org/2021/1666
  22. Samira Ghodratnama and Mehrdad Zakershahrak. 2023. Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications. arXiv 2311.12287 (2023). http://arxiv.org/abs/2311.12287
  23. EdgeRec: Recommender System on Edge in Mobile Taobao. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. https://doi.org/10.1145/3340531.3412700
  24. Google. 2015. Lit: Simple Fast Web Components. https://lit.dev/
  25. Rich Harris. 2016. Svelte: Cybernetically Enhanced Web Apps. https://svelte.dev/
  26. A Retrieve-and-Edit Framework for Predicting Structured Outputs. Advances in Neural Information Processing Systems 31 (2018).
  27. Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models. arXiv 2308.00675 (2023). http://arxiv.org/abs/2308.00675
  28. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2106.09685 (2021). http://arxiv.org/abs/2106.09685
  29. Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. https://doi.org/10.18653/v1/2021.eacl-main.74
  30. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011). https://doi.org/10.1109/TPAMI.2010.57
  31. Computers and Iphones and Mobile Phones, Oh My!: A Logs-Based Comparison of Search Users on Different Devices. In Proceedings of the 18th International Conference on World Wide Web. https://doi.org/10.1145/1526709.1526817
  32. Andrew Kane. 2021. Pgvector: Open-source Vector Similarity Search for Postgres. pgvector. https://github.com/pgvector/pgvector
  33. WASP: Web Archiving and Search Personalized. In DESIRES.
  34. Jon M. Kleinberg. 2000. Navigation in a Small World. Nature 406 (2000). https://doi.org/10.1038/35022643
  35. Jupyter Notebooks-a Publishing Format for Reproducible Computational Workflows. 2016 (2016). https://doi.org/10.3233/978-1-61499-649-1-87
  36. GPU-based Private Information Retrieval for On-Device Machine Learning Inference. arXiv 2301.10904 (2023). http://arxiv.org/abs/2301.10904
  37. Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/383952.383972
  38. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv 2104.08691 (2021). http://arxiv.org/abs/2104.08691
  39. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 33 (2020).
  40. LibVQ: A Toolkit for Optimizing Vector Quantization and Efficient Neural Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3539618.3591799
  41. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv 2308.03281 (2023). http://arxiv.org/abs/2308.03281
  42. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3463238
  43. Joshua Lochner. 2023. Transformers.Js: State-of-the-art Machine Learning for the Web. https://github.com/xenova/transformers.js
  44. Learning a Neural Diff for Speech Models. In Interspeech 2021. https://www.amazon.science/publications/learning-a-neural-diff-for-speech-models
  45. Amortized Neural Networks for Low-Latency Speech Recognition. In Interspeech 2021. https://www.amazon.science/publications/amortized-neural-networks-for-low-latency-speech-recognition
  46. Ogier Maitre. 2018. Total Canvas Memory Use Exceeds the Maximum Limit (Safari 12) - Stack Overflow. https://stackoverflow.com/questions/52532614/total-canvas-memory-use-exceeds-the-maximum-limit-safari-12
  47. Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2020). https://doi.org/10.1109/TPAMI.2018.2889473
  48. Kim Martineau. 2023. What Is Retrieval-Augmented Generation? https://research.ibm.com/blog/retrieval-augmented-generation-RAG
  49. MDN. 2021. IndexedDB API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API
  50. MDN. 2023a. Storage Quotas and Eviction Criteria - Web APIs | MDN. https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria
  51. MDN. 2023b. Streams API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/Streams_API
  52. MDN. 2023c. Web Workers API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API
  53. Gavin Mendel-Gleason. 2024. Parallelising HNSW. https://github.com/GavinMendelGleason/blog/blob/main/entries/parallelising_hnsw.md
  54. Team MLC. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
  55. Text and Code Embeddings by Contrastive Pre-Training. (2022). https://doi.org/10.48550/ARXIV.2201.10005
  56. OpenAI. 2023. GPT-4 Technical Report. arXiv 2303.08774 (2023). http://arxiv.org/abs/2303.08774
  57. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv 2312.05934 (2024). http://arxiv.org/abs/2312.05934
  58. Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities. arXiv 2312.01291 (2023). http://arxiv.org/abs/2312.01291
  59. William Pugh. 1990. Skip Lists: A Probabilistic Alternative to Balanced Trees. Commun. ACM 33 (1990). https://doi.org/10.1145/78973.78977
  60. Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3462987
  61. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/1908.10084
  62. TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage. arXiv 2308.03427 (2023). http://arxiv.org/abs/2308.03427
  63. RxDB. 2021. Why IndexedDB Is Slow and What to Use Instead. https://rxdb.info/slow-indexeddb.html
  64. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.157
  65. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv 2104.07567 (2021). http://arxiv.org/abs/2104.07567
  66. TensorFlow.Js: Machine Learning for the Web and Beyond. arXiv (2019). https://arxiv.org/abs/1901.05350
  67. DocuT5: Seq2seq SQL Generation with Table Documentation. arXiv 2211.06193 (2022). http://arxiv.org/abs/2211.06193
  68. Dynamic Multi-Branch Layers for On-Device Neural Machine Translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022). https://doi.org/10.1109/TASLP.2022.3153257
  69. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2307.09288 (2023). https://arxiv.org/abs/2307.09288
  70. Wordflow: Social Prompt Engineering for Large Language Models. arXiv 2401.14447 (2024). http://arxiv.org/abs/2401.14447
  71. Zijie J. Wang and Duen Horng Chau. 2023. WebSHAP: Towards Explaining Any Machine Learning Models Anywhere. In Companion Proceedings of the Web Conference 2023. https://doi.org/10.1145/3543873.3587362
  72. WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). https://aclanthology.org/2023.acl-demo.50
  73. Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). https://doi.org/10.1145/3534678.3539074
  74. GAM Coach: Towards Interactive and User-centered Algorithmic Recourse. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3580816
  75. Thomas Wilkerling. 2019. FlexSearch: Next-Generation Full Text Search Library for Browser and Node.Js. https://github.com/nextapps-de/flexsearch
  76. Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective. arXiv 2311.15792 (2023). http://arxiv.org/abs/2311.15792
  77. Efficient On-Device Session-Based Recommendation. ACM Transactions on Information Systems (2023). https://doi.org/10.1145/3580364
  78. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581388
  79. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1139
  80. DocPrompting: Generating Code by Retrieving the Docs. arXiv 2207.05987 (2023). http://arxiv.org/abs/2207.05987
  81. Eric Zhu. 2016. Ekzhu/Datasketch: MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW. https://github.com/ekzhu/datasketch

Summary

  • The paper demonstrates the feasibility of on-device dense retrieval by adapting HNSW graphs with IndexedDB and Web Workers for efficient in-browser text generation.
  • The paper introduces RAG Playground, a no-code platform that prototypes retrieval-augmented generation to enhance accuracy and preserve user privacy.
  • The paper offers an open-source framework that integrates modern web and machine learning technologies to support scalable, privacy-preserving AI applications.

MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation

MeMemo, a Javascript-based toolkit for in-browser retrieval-augmented generation (RAG), navigates the challenges of privacy and client-side computation, presenting a compelling solution for researchers and practitioners. Authored by Zijie J. Wang and Duen Horng Chau, the paper introduces a breakthrough in on-device dense retrieval by leveraging hierarchical navigable small world (HNSW) graphs, modern web technologies, and a novel prefetching strategy to effectively handle large vector databases directly within the browser environment.

Overview of the Paper

Introduction

The paper begins by outlining the necessity of retrieval-augmented text generation (RAG) in mitigating common deficits in LLMs, such as hallucinations. Traditional RAG models depend heavily on backend systems for data storage and retrieval, which limits their applicability in privacy-sensitive domains like personal finance and medicine. MeMemo addresses this limitation by introducing an on-device RAG system enabling dense vector retrieval within the browser.

Contributions

1. Client-Side Dense Retrieval

The centerpiece of MeMemo is its adaptation of the HNSW technique to JavaScript, allowing efficient search through millions of vectors using client-side hardware. MeMemo utilizes IndexedDB and Web Workers to optimize storage and retrieval operations within browser environments. This ensures performance efficiency even with a significant dataset size.

2. RAG Playground

The toolkit's efficacy is demonstrated through an application called RAG Playground, which provides an interactive platform for prototyping RAG applications in the browser. This no-code tool allows users to explore various RAG features, making it accessible for stakeholders with diverse technical backgrounds. It enables user query input, document retrieval, prompt augmentation, and executes on-device LLMs to validate improved response reliability and accuracy.

3. Open-Source Accessibility

MeMemo is released as an open-source project, complete with comprehensive documentation and an example application to facilitate adoption and adaptation by researchers and developers. Its design emphasizes minimal dependencies and usability within various web development stacks (TypeScript, JavaScript, React, Svelte, and Lit).

Results and Implementations

Performance and Integration

The authors illustrate the performance implications through various hypothetical usage scenarios, underlining how MeMemo's integration supports the prototype development of client-side RAG systems efficiently. For instance, creating a HNSW index of 1 million 384-dimensional vectors, though slower compared to traditional systems, achieves real-time query performance - showcasing its practical viability.

The paper details how MeMemo can be integrated with existing web machine learning technologies for real-time applications. By combining tensor models through Web LLM or ONNX, the results illustrate substantial improvements in private and personalized content generation, specifically addressing domains where data privacy is paramount.

Challenges and Future Directions

Despite its innovative approach, MeMemo's performance in index creation is slower due to browser computation constraints. This could be improved by developing parallel processing capabilities and refining prefetching strategies. Future work could also explore enhancements in personal information management, interactive RAG prototyping, and optimizing on-device retrieval for mobile and IoT applications.

Implications

MeMemo's introduction paves the way for broader adoption of privacy-preserving AI tools. It offers substantial opportunities for extending dense retrieval to personal devices, fostering newer research avenues in interactive ML systems. By integrating dense retrieval on client-side platforms, it bridges a significant gap in enabling scalable, private, and personalized AI applications directly within user browsers.

Conclusion

MeMemo revolutionizes the integration of retrieval-augmented generation techniques by empowering in-browser dense retrieval, while addressing privacy and performance challenges head-on. This toolkit, complemented by RAG Playground, serves as a versatile and accessible resource for researchers and developers aiming to exploit on-device RAG in their applications. With its open-source availability, MeMemo is well poised to inspire further advancements in on-device AI technologies.

For more details, MeMemo code and usage examples are available on GitHub MeMemo repository.

Github Logo Streamline Icon: https://streamlinehq.com