(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs (2311.11123v2)
Abstract: LLMs are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.
- Debugging Tests for Model Explanations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Article 60, 13 pages.
- ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, 337–346.
- Anthropic. 2023. Claude. https://claude.ai/
- Bridging the Gap between ML Solutions and Their Business Requirements Using Feature Interactions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, 1048–1058.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901.
- Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, Article 419, 14 pages.
- How is ChatGPT’s behavior changing over time? arXiv:2307.09009 [cs.CL]
- Jigsaw Unintended Bias in Toxicity Classification. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification
- Beware the Evolving ‘intelligent’ Web Service! An Integration Architecture Tactic to Guard AI-First Components. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, 269–280.
- Losing Confidence in Quality: Unspoken Evolution of Computer Vision Services. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). 333–342. https://doi.org/10.1109/ICSME.2019.00051
- Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960 (2022).
- InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations.
- Google. 2023a. Bard. https://bard.google.com/chat
- Google. 2023b. Using machine learning to reduce toxicity online. https://www.perspectiveapi.com/
- Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017).
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 (2023).
- Christian Kästner. 2022. Machine Learning in Production: From Models to Products.
- How does web service API evolution affect clients?. In 2013 IEEE 20th International Conference on Web Services. IEEE, 300–307.
- Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages.
- Instruction Position Matters in Sequence Generation with Large Language Models. ArXiv abs/2308.12097 (2023). https://api.semanticscholar.org/CorpusID:261076308
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 8086–8098.
- “Did You Miss My Comment or What?” Understanding Toxicity in Open Source Discussions. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 710–722. https://doi.org/10.1145/3510003.3510111
- Abhishek Mishra. 2019. Machine learning in the AWS cloud: Add intelligence to applications with Amazon Sagemaker and Amazon Rekognition. https://aws.amazon.com/rekognition/
- PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models. arXiv preprint arXiv:2304.01964 (2023).
- Stress Test Evaluation for Natural Language Inference. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 2340–2353.
- OpenAI. 2023a. ChatGPT. https://chat.openai.com/
- OpenAI. 2023b. Deprecations - OpenAI API. https://platform.openai.com/docs/deprecations
- OpenAI. 2023c. GPT-3.5 Documentation. Retrieved from. https://platform.openai.com/docs/models/gpt-3-5
- LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
- radiator57. 2023. Experiencing Decreased Performance with ChatGPT-4. https://community.openai.com/t/experiencing-decreased-performance-with-chatgpt-4/234269
- Marco Tulio Ribeiro. 2023. Testing language models (and prompts) like we test software. Medium (May 2023). https://towardsdatascience.com/testing-large-language-models-like-we-test-software-92745d28a359
- Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive Testing and Debugging of NLP Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 3253–3267.
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, 4902–4912.
- Ian Sommerville. 2015. Software Engineering (10th ed.). Pearson.
- Automatic Testing and Improvement of Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, 974–985.
- Improving Machine Translation Systems via Isotopic Replacement. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, 1181–1192.
- ChatLog: Recording and Analyzing ChatGPT Across Time. arXiv preprint arXiv:2304.14106 (2023).
- Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390 (2022).
- Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 747–763.
- Capabilities for Better ML Engineering. In Proceedings of the AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI) (Washington, DC).
- Beyond Testers’ Biases: Guiding Model Testing with Knowledge Bases using LLMs. (12 2023). http://arxiv.org/abs/2310.09668
- Robert K Yin. 2009. Case study research: Design and methods. Vol. 5. sage.
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, Article 437, 21 pages.
- Mitigating Uncertainty in Document Classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 3126–3136.
- Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697–12706.
- Towards a Unified Multi-Dimensional Evaluator for Text Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 2023–2038.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.