MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation (2407.01972v1)
Abstract: Retrieval-augmented text generation (RAG) addresses the common limitations of LLMs, such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as personal finance, education, and medicine. To address the pressing need for client-side dense retrieval, we introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments. Developed with modern and native Web technologies, such as IndexedDB and Web Workers, our toolkit leverages client-side hardware capabilities to enable researchers and developers to efficiently search through millions of high-dimensional vectors in the browser. MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping, as demonstrated in our example application RAG Playground. Reflecting on our work, we discuss the opportunities and challenges for on-device dense retrieval. MeMemo is available at https://github.com/poloclub/mememo.
- Phi-2: The Surprising Power of Small Language Models. (2023). https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
- Apple. 2017. Core ML: Integrate Machine Learning Models into Your App. https://developer.apple.com/documentation/coreml
- ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx
- RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. (2024). https://doi.org/10.48550/ARXIV.2401.08406
- Gordon Bell. 2001. A Personal Digital Store. Commun. ACM 44 (2001). https://doi.org/10.1145/357489.357513
- Navigability of Complex Networks. Nature Physics 5 (2009). https://doi.org/10.1038/nphys1130
- Personal Information Management with SEMEX. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/1066157.1066289
- Harrison Chase. 2022. LangChain: Building Applications with LLMs through Composability. https://github.com/langchain-ai/langchain
- What to Do When Search Fails: Finding Information by Association. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1357054.1357208
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). https://www.usenix.org/conference/osdi18/presentation/chen
- Challenges of Large Language Models for Mental Health Counseling. arXiv 2311.13857 (2023). http://arxiv.org/abs/2311.13857
- The Power of Noise: Redefining Retrieval for RAG Systems. arXiv 2401.14887 (2024). http://arxiv.org/abs/2401.14887
- Ryan Dahl. 2009. Node.Js: An Open-Source, Cross-Platform JavaScript Runtime Environment. (2009). https://nodejs.org/en/
- The Faiss Library. arXiv 2401.08281 (2024). http://arxiv.org/abs/2401.08281
- Gender, Age, and Technology Education Influence the Adoption and Appropriation of LLMs. arXiv 2310.06556 (2023). http://arxiv.org/abs/2310.06556
- Facebook. 2013. React: The Library for Web and Native User Interfaces. https://react.dev/
- David Fahlander. 2021. Dexie.Js - Minimalistic IndexedDB Wrapper. https://dexie.org/
- CoPrompt: Supporting Prompt Sharing and Referring in Collaborative Natural Language Programming. arXiv 2310.09235 (2023). http://arxiv.org/abs/2310.09235
- Tiago Forte. 2022. Building a Second Brain: A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential (first atria books hardcover edition ed.).
- Eric Freeman and David Gelernter. 1996. Lifestreams: A Storage Model for Personal Data. ACM SIGMOD Record 25 (1996). https://doi.org/10.1145/381854.381893
- Approximate Distance-Comparison-Preserving Symmetric Encryption. Cryptology ePrint Archive, Paper 2021/1666. https://eprint.iacr.org/2021/1666
- Samira Ghodratnama and Mehrdad Zakershahrak. 2023. Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications. arXiv 2311.12287 (2023). http://arxiv.org/abs/2311.12287
- EdgeRec: Recommender System on Edge in Mobile Taobao. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. https://doi.org/10.1145/3340531.3412700
- Google. 2015. Lit: Simple Fast Web Components. https://lit.dev/
- Rich Harris. 2016. Svelte: Cybernetically Enhanced Web Apps. https://svelte.dev/
- A Retrieve-and-Edit Framework for Predicting Structured Outputs. Advances in Neural Information Processing Systems 31 (2018).
- Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models. arXiv 2308.00675 (2023). http://arxiv.org/abs/2308.00675
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2106.09685 (2021). http://arxiv.org/abs/2106.09685
- Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. https://doi.org/10.18653/v1/2021.eacl-main.74
- Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011). https://doi.org/10.1109/TPAMI.2010.57
- Computers and Iphones and Mobile Phones, Oh My!: A Logs-Based Comparison of Search Users on Different Devices. In Proceedings of the 18th International Conference on World Wide Web. https://doi.org/10.1145/1526709.1526817
- Andrew Kane. 2021. Pgvector: Open-source Vector Similarity Search for Postgres. pgvector. https://github.com/pgvector/pgvector
- WASP: Web Archiving and Search Personalized. In DESIRES.
- Jon M. Kleinberg. 2000. Navigation in a Small World. Nature 406 (2000). https://doi.org/10.1038/35022643
- Jupyter Notebooks-a Publishing Format for Reproducible Computational Workflows. 2016 (2016). https://doi.org/10.3233/978-1-61499-649-1-87
- GPU-based Private Information Retrieval for On-Device Machine Learning Inference. arXiv 2301.10904 (2023). http://arxiv.org/abs/2301.10904
- Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/383952.383972
- The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv 2104.08691 (2021). http://arxiv.org/abs/2104.08691
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 33 (2020).
- LibVQ: A Toolkit for Optimizing Vector Quantization and Efficient Neural Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3539618.3591799
- Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv 2308.03281 (2023). http://arxiv.org/abs/2308.03281
- Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3463238
- Joshua Lochner. 2023. Transformers.Js: State-of-the-art Machine Learning for the Web. https://github.com/xenova/transformers.js
- Learning a Neural Diff for Speech Models. In Interspeech 2021. https://www.amazon.science/publications/learning-a-neural-diff-for-speech-models
- Amortized Neural Networks for Low-Latency Speech Recognition. In Interspeech 2021. https://www.amazon.science/publications/amortized-neural-networks-for-low-latency-speech-recognition
- Ogier Maitre. 2018. Total Canvas Memory Use Exceeds the Maximum Limit (Safari 12) - Stack Overflow. https://stackoverflow.com/questions/52532614/total-canvas-memory-use-exceeds-the-maximum-limit-safari-12
- Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2020). https://doi.org/10.1109/TPAMI.2018.2889473
- Kim Martineau. 2023. What Is Retrieval-Augmented Generation? https://research.ibm.com/blog/retrieval-augmented-generation-RAG
- MDN. 2021. IndexedDB API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API
- MDN. 2023a. Storage Quotas and Eviction Criteria - Web APIs | MDN. https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria
- MDN. 2023b. Streams API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/Streams_API
- MDN. 2023c. Web Workers API - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API
- Gavin Mendel-Gleason. 2024. Parallelising HNSW. https://github.com/GavinMendelGleason/blog/blob/main/entries/parallelising_hnsw.md
- Team MLC. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- Text and Code Embeddings by Contrastive Pre-Training. (2022). https://doi.org/10.48550/ARXIV.2201.10005
- OpenAI. 2023. GPT-4 Technical Report. arXiv 2303.08774 (2023). http://arxiv.org/abs/2303.08774
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv 2312.05934 (2024). http://arxiv.org/abs/2312.05934
- Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities. arXiv 2312.01291 (2023). http://arxiv.org/abs/2312.01291
- William Pugh. 1990. Skip Lists: A Probabilistic Alternative to Balanced Trees. Commun. ACM 33 (1990). https://doi.org/10.1145/78973.78977
- Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3404835.3462987
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/1908.10084
- TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage. arXiv 2308.03427 (2023). http://arxiv.org/abs/2308.03427
- RxDB. 2021. Why IndexedDB Is Slow and What to Use Instead. https://rxdb.info/slow-indexeddb.html
- WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.157
- Retrieval Augmentation Reduces Hallucination in Conversation. arXiv 2104.07567 (2021). http://arxiv.org/abs/2104.07567
- TensorFlow.Js: Machine Learning for the Web and Beyond. arXiv (2019). https://arxiv.org/abs/1901.05350
- DocuT5: Seq2seq SQL Generation with Table Documentation. arXiv 2211.06193 (2022). http://arxiv.org/abs/2211.06193
- Dynamic Multi-Branch Layers for On-Device Neural Machine Translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022). https://doi.org/10.1109/TASLP.2022.3153257
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2307.09288 (2023). https://arxiv.org/abs/2307.09288
- Wordflow: Social Prompt Engineering for Large Language Models. arXiv 2401.14447 (2024). http://arxiv.org/abs/2401.14447
- Zijie J. Wang and Duen Horng Chau. 2023. WebSHAP: Towards Explaining Any Machine Learning Models Anywhere. In Companion Proceedings of the Web Conference 2023. https://doi.org/10.1145/3543873.3587362
- WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). https://aclanthology.org/2023.acl-demo.50
- Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). https://doi.org/10.1145/3534678.3539074
- GAM Coach: Towards Interactive and User-centered Algorithmic Recourse. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3580816
- Thomas Wilkerling. 2019. FlexSearch: Next-Generation Full Text Search Library for Browser and Node.Js. https://github.com/nextapps-de/flexsearch
- Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective. arXiv 2311.15792 (2023). http://arxiv.org/abs/2311.15792
- Efficient On-Device Session-Based Recommendation. ACM Transactions on Information Systems (2023). https://doi.org/10.1145/3580364
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581388
- ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1139
- DocPrompting: Generating Code by Retrieving the Docs. arXiv 2207.05987 (2023). http://arxiv.org/abs/2207.05987
- Eric Zhu. 2016. Ekzhu/Datasketch: MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW. https://github.com/ekzhu/datasketch