Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 162 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems in LLM Augmented Generation (2402.14480v1)

Published 22 Feb 2024 in cs.SE

Abstract: Augmented generation techniques such as Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG) have revolutionized the field by enhancing LLM outputs with external knowledge and cached information. However, the integration of vector databases, which serve as a backbone for these augmentations, introduces critical challenges, particularly in ensuring accurate vector matching. False vector matching in these databases can significantly compromise the integrity and reliability of LLM outputs, leading to misinformation or erroneous responses. Despite the crucial impact of these issues, there is a notable research gap in methods to effectively detect and address false vector matches in LLM-augmented generation. This paper presents MeTMaP, a metamorphic testing framework developed to identify false vector matching in LLM-augmented generation systems. We derive eight metamorphic relations (MRs) from six NLP datasets, which form our method's core, based on the idea that semantically similar texts should match and dissimilar ones should not. MeTMaP uses these MRs to create sentence triplets for testing, simulating real-world LLM scenarios. Our evaluation of MeTMaP over 203 vector matching configurations, involving 29 embedding models and 7 distance metrics, uncovers significant inaccuracies. The results, showing a maximum accuracy of only 41.51\% on our tests compared to the original datasets, emphasize the widespread issue of false matches in vector matching methods and the critical need for effective detection and mitigation in LLM-augmented applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. 2023. MeTMaP. https://anonymous.4open.science/r/MeTMaP-879B. (2023).
  2. 0xk1h0. 2023. ChatGPT_DAN. https://github.com/0xk1h0/ChatGPT_DAN. (2023).
  3. Basemah Alshemali and Jugal Kalita. 2020. Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems 191 (2020).
  4. Mahalanobis Prasanta Chandra et al. 1936. On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India, Vol. 2. 49–55.
  5. Harrison Chase. 2022. LangChain. https://python.langchain.com/docs/get_started/introduction. (2022).
  6. Testing Your Question Answering Software via Asking Recursively. In ASE. 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
  7. Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020). https://api.semanticscholar.org/CorpusID:15467386
  8. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (jan 2018), 27.
  9. Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS. arXiv preprint arXiv:2306.05083 (2023).
  10. chroma core. 2023. Chroma. https://github.com/chroma-core/chroma. (2023).
  11. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
  12. cohere. 2023. Cohere. https://dashboard.cohere.com/. (2023).
  13. Finding contradictions in text. In Proceedings of acl-08: Hlt. 1039–1047.
  14. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. (2023). arXiv:cs.CR/2307.08715
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
  16. docarray. 2023. DocArray. https://github.com/docarray/docarray. (2023).
  17. Natural language processing. In Introduction to Artificial Intelligence. 87–99.
  18. FlowiseAI. 2023. Flowise. https://github.com/FlowiseAI/Flowise. (2023).
  19. Interpretation-based Code Summarization. In ICPC.
  20. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. (2023). arXiv:cs.CR/2302.12173
  21. Sylvain Gugger. 2023. RWKV. https://huggingface.co/sgugger/rwkv-430M-pile. (2023).
  22. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In International Conference on Machine Learning. https://arxiv.org/abs/1908.10396
  23. Threats to Pre-trained Language Models: Survey and Taxonomy. (2022). arXiv:cs.CR/2202.06862
  24. Walid Hariri. 2023. Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing. arXiv preprint (2023).
  25. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. (2021). arXiv:cs.CL/2111.09543
  26. Large Language Models for Software Engineering: A Systematic Literature Review. (2023). arXiv:cs.SE/2308.10620
  27. ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory. (2023). arXiv:cs.AI/2306.03901
  28. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  29. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics 8 (2020), 64–77.
  30. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
  31. Sana Zehra Kamoonpuri and Anita Sengar. 2023. Hi, May AI help you? An analysis of the barriers impeding the implementation and use of artificial intelligence-enabled virtual assistants in retail. Journal of Retailing and Consumer Services 72 (2023).
  32. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163–177.
  33. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019). arXiv:1909.11942 http://arxiv.org/abs/1909.11942
  34. David D Lewis and Karen Spärck Jones. 1996. Natural language processing for information retrieval. Communications of the ACM 39, 1 (1996), 92–101.
  35. Digger: Detecting Copyright Content Mis-usage in Large Language Model Training. (2024). arXiv:cs.CR/2401.00676
  36. TextBugger: Generating Adversarial Text Against Real-world Applications. In NDSS.
  37. Hidden Backdoors in Human-Centric Language Models. (2021). arXiv:cs.CL/2105.00164
  38. Jerry Liu. 2022. LlamaIndex. (11 2022). https://doi.org/10.5281/zenodo.1234
  39. Prompt Injection attack against LLM-integrated Applications. arXiv preprint (2023).
  40. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  41. Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002).
  42. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint (2023).
  43. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In International Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:220483049
  44. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
  45. Michihiro Yasunaga and Jure Leskovec and Percy Liang. 2022. LinkBERT: Pretraining Language Models with Document Links. In ACL.
  46. microsoft. 2023. MPNet. https://huggingface.co/microsoft/mpnet-base. (2023).
  47. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).
  48. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021).
  49. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021).
  50. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
  51. Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. (2022). arXiv:cs.CL/2211.09527
  52. pgvector. 2023. PGVector. https://github.com/pgvector/pgvector. (2023).
  53. pinecone. 2023. Pinecone. https://www.pinecone.io/. (2023).
  54. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In ACL. 4902–4912.
  55. Abdul Rahaman Wahab Sait and Mohamad Khairi Ishak. 2023. Deep learning with natural language processing enabled sentimental analysis on sarcasm classification. Comput. Syst. Sci. Eng 44, 3 (2023), 2553–2567.
  56. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
  57. An Experimental Study on Applying Metamorphic Testing in Machine Learning Applications. Proceedings of the 5th Brazilian Symposium on Systematic and Automated Software Testing (2020). https://api.semanticscholar.org/CorpusID:225040791
  58. Get your vitamin C! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541 (2021).
  59. Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021).
  60. scipy. 2023. Fundamental algorithms for scientific computing in Python. https://scipy.org/. (2023).
  61. Spotify. 2023. Annoy. https://github.com/spotify/annoy?tab=readme-ov-file. (2023).
  62. Improving Machine Translation Systems via Isotopic Replacement. In ICSE. 1181–1192.
  63. Reliability Testing for Natural Language Processing Systems. (2021). arXiv:cs.LG/2105.02590
  64. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  65. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  66. unum cloud. 2023. Uform. https://huggingface.co/unum-cloud/uform-vl-english. (2023).
  67. Ellen M Voorhees. 1999. Natural language processing and information retrieval. In International summer school on information extraction. 32–48.
  68. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data. 2614–2627.
  69. Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731 (2021).
  70. MTTM: Metamorphic Testing for Textual Content Moderation Software. In ICSE. 2387–2399.
  71. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
  72. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  73. NevIR: Negation in Neural Information Retrieval. arXiv preprint arXiv:2305.07614 (2023).
  74. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 996–1005.
  75. Natural Language processing implementation for sentiment analysis on tweets. In MRCN. 317–327.
  76. Metamorphic Testing of Deep Learning Compilers. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6 (2022), 1 – 28. https://api.semanticscholar.org/CorpusID:247159402
  77. LEAP: Efficient and Automated Test Method for NLP Software. arXiv preprint arXiv:2308.11284 (2023).
  78. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint (2023).
  79. XLNet: Generalized Autoregressive Pretraining for Language Understanding. CoRR abs/1906.08237 (2019). arXiv:1906.08237 http://arxiv.org/abs/1906.08237
  80. Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 64, 2 (2023), 292–305.
  81. Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022). https://api.semanticscholar.org/CorpusID:252816122
  82. PAWS: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130 (2019).
  83. An artificially intelligent, natural language processing chatbot designed to promote COVID-19 vaccination: A proof-of-concept pilot study. Digital Health 9 (2023).
  84. zilliztech. 2023a. GPTCache. https://github.com/zilliztech/GPTCache. (2023).
  85. zilliztech. 2023b. Paraphrase-albert-onnx. https://huggingface.co/GPTCache/paraphrase-albert-onnx. (2023).
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: