MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems in LLM Augmented Generation (2402.14480v1)
Abstract: Augmented generation techniques such as Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG) have revolutionized the field by enhancing LLM outputs with external knowledge and cached information. However, the integration of vector databases, which serve as a backbone for these augmentations, introduces critical challenges, particularly in ensuring accurate vector matching. False vector matching in these databases can significantly compromise the integrity and reliability of LLM outputs, leading to misinformation or erroneous responses. Despite the crucial impact of these issues, there is a notable research gap in methods to effectively detect and address false vector matches in LLM-augmented generation. This paper presents MeTMaP, a metamorphic testing framework developed to identify false vector matching in LLM-augmented generation systems. We derive eight metamorphic relations (MRs) from six NLP datasets, which form our method's core, based on the idea that semantically similar texts should match and dissimilar ones should not. MeTMaP uses these MRs to create sentence triplets for testing, simulating real-world LLM scenarios. Our evaluation of MeTMaP over 203 vector matching configurations, involving 29 embedding models and 7 distance metrics, uncovers significant inaccuracies. The results, showing a maximum accuracy of only 41.51\% on our tests compared to the original datasets, emphasize the widespread issue of false matches in vector matching methods and the critical need for effective detection and mitigation in LLM-augmented applications.
- 2023. MeTMaP. https://anonymous.4open.science/r/MeTMaP-879B. (2023).
- 0xk1h0. 2023. ChatGPT_DAN. https://github.com/0xk1h0/ChatGPT_DAN. (2023).
- Basemah Alshemali and Jugal Kalita. 2020. Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems 191 (2020).
- Mahalanobis Prasanta Chandra et al. 1936. On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India, Vol. 2. 49–55.
- Harrison Chase. 2022. LangChain. https://python.langchain.com/docs/get_started/introduction. (2022).
- Testing Your Question Answering Software via Asking Recursively. In ASE. 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
- Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020). https://api.semanticscholar.org/CorpusID:15467386
- Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (jan 2018), 27.
- Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS. arXiv preprint arXiv:2306.05083 (2023).
- chroma core. 2023. Chroma. https://github.com/chroma-core/chroma. (2023).
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
- cohere. 2023. Cohere. https://dashboard.cohere.com/. (2023).
- Finding contradictions in text. In Proceedings of acl-08: Hlt. 1039–1047.
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. (2023). arXiv:cs.CR/2307.08715
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
- docarray. 2023. DocArray. https://github.com/docarray/docarray. (2023).
- Natural language processing. In Introduction to Artificial Intelligence. 87–99.
- FlowiseAI. 2023. Flowise. https://github.com/FlowiseAI/Flowise. (2023).
- Interpretation-based Code Summarization. In ICPC.
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. (2023). arXiv:cs.CR/2302.12173
- Sylvain Gugger. 2023. RWKV. https://huggingface.co/sgugger/rwkv-430M-pile. (2023).
- Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In International Conference on Machine Learning. https://arxiv.org/abs/1908.10396
- Threats to Pre-trained Language Models: Survey and Taxonomy. (2022). arXiv:cs.CR/2202.06862
- Walid Hariri. 2023. Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing. arXiv preprint (2023).
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. (2021). arXiv:cs.CL/2111.09543
- Large Language Models for Software Engineering: A Systematic Literature Review. (2023). arXiv:cs.SE/2308.10620
- ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory. (2023). arXiv:cs.AI/2306.03901
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics 8 (2020), 64–77.
- FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
- Sana Zehra Kamoonpuri and Anita Sengar. 2023. Hi, May AI help you? An analysis of the barriers impeding the implementation and use of artificial intelligence-enabled virtual assistants in retail. Journal of Retailing and Consumer Services 72 (2023).
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163–177.
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019). arXiv:1909.11942 http://arxiv.org/abs/1909.11942
- David D Lewis and Karen Spärck Jones. 1996. Natural language processing for information retrieval. Communications of the ACM 39, 1 (1996), 92–101.
- Digger: Detecting Copyright Content Mis-usage in Large Language Model Training. (2024). arXiv:cs.CR/2401.00676
- TextBugger: Generating Adversarial Text Against Real-world Applications. In NDSS.
- Hidden Backdoors in Human-Centric Language Models. (2021). arXiv:cs.CL/2105.00164
- Jerry Liu. 2022. LlamaIndex. (11 2022). https://doi.org/10.5281/zenodo.1234
- Prompt Injection attack against LLM-integrated Applications. arXiv preprint (2023).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002).
- Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint (2023).
- Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In International Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:220483049
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
- Michihiro Yasunaga and Jure Leskovec and Percy Liang. 2022. LinkBERT: Pretraining Language Models with Document Links. In ACL.
- microsoft. 2023. MPNet. https://huggingface.co/microsoft/mpnet-base. (2023).
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021).
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021).
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
- Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. (2022). arXiv:cs.CL/2211.09527
- pgvector. 2023. PGVector. https://github.com/pgvector/pgvector. (2023).
- pinecone. 2023. Pinecone. https://www.pinecone.io/. (2023).
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In ACL. 4902–4912.
- Abdul Rahaman Wahab Sait and Mohamad Khairi Ishak. 2023. Deep learning with natural language processing enabled sentimental analysis on sarcasm classification. Comput. Syst. Sci. Eng 44, 3 (2023), 2553–2567.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
- An Experimental Study on Applying Metamorphic Testing in Machine Learning Applications. Proceedings of the 5th Brazilian Symposium on Systematic and Automated Software Testing (2020). https://api.semanticscholar.org/CorpusID:225040791
- Get your vitamin C! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541 (2021).
- Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021).
- scipy. 2023. Fundamental algorithms for scientific computing in Python. https://scipy.org/. (2023).
- Spotify. 2023. Annoy. https://github.com/spotify/annoy?tab=readme-ov-file. (2023).
- Improving Machine Translation Systems via Isotopic Replacement. In ICSE. 1181–1192.
- Reliability Testing for Natural Language Processing Systems. (2021). arXiv:cs.LG/2105.02590
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- unum cloud. 2023. Uform. https://huggingface.co/unum-cloud/uform-vl-english. (2023).
- Ellen M Voorhees. 1999. Natural language processing and information retrieval. In International summer school on information extraction. 32–48.
- Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data. 2614–2627.
- Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731 (2021).
- MTTM: Metamorphic Testing for Textual Content Moderation Software. In ICSE. 2387–2399.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- NevIR: Negation in Neural Information Retrieval. arXiv preprint arXiv:2305.07614 (2023).
- Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 996–1005.
- Natural Language processing implementation for sentiment analysis on tweets. In MRCN. 317–327.
- Metamorphic Testing of Deep Learning Compilers. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6 (2022), 1 – 28. https://api.semanticscholar.org/CorpusID:247159402
- LEAP: Efficient and Automated Test Method for NLP Software. arXiv preprint arXiv:2308.11284 (2023).
- Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint (2023).
- XLNet: Generalized Autoregressive Pretraining for Language Understanding. CoRR abs/1906.08237 (2019). arXiv:1906.08237 http://arxiv.org/abs/1906.08237
- Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 64, 2 (2023), 292–305.
- Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022). https://api.semanticscholar.org/CorpusID:252816122
- PAWS: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130 (2019).
- An artificially intelligent, natural language processing chatbot designed to promote COVID-19 vaccination: A proof-of-concept pilot study. Digital Health 9 (2023).
- zilliztech. 2023a. GPTCache. https://github.com/zilliztech/GPTCache. (2023).
- zilliztech. 2023b. Paraphrase-albert-onnx. https://huggingface.co/GPTCache/paraphrase-albert-onnx. (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.