SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines (2401.03038v2)
Abstract: LLMs are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose {\em data quality assertions} to identify when LLMs may be making mistakes. We present SPADE, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. SPADE therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14\% and decreases false failures by 21\% when compared to simpler baselines. SPADE has been deployed as an offering within LangSmith, LangChain's LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.
- Data Platform for Machine Learning. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1803–1816. https://doi.org/10.1145/3299869.3314050
- Prediction-powered inference. arXiv preprint arXiv:2301.09633 (2023).
- ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023).
- Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 (2022).
- Data Validation for Machine Learning. In Proceedings of SysML. https://mlsys.org/Conferences/2019/doc/2019/167.pdf
- Rui Castro and Robert Nowak. 2008. Active learning and sampling. In Foundations and Applications of Sensor Management. Springer, 177–200.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023).
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
- How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains. arXiv preprint arXiv:2306.12028 (2023).
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Is GPT-3 text indistinguishable from human text? SCARECROW: A framework for scrutinizing machine text. arXiv preprint arXiv:2107.01294 (2021).
- RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217 (2023).
- Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
- Towards Autonomous Testing Agents via Conversational Large Language Models. arXiv preprint arXiv:2306.05152 (2023).
- How Large Language Models Will Disrupt Data Management. Proc. VLDB Endow. 16, 11 (jul 2023), 3302–3309. https://doi.org/10.14778/3611479.3611527
- John Forrest and Robin Lougee-Heimer. 2005. CBC user guide. In Emerging theory, methods, and applications. INFORMS, 257–277.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research 77 (2023), 103–166.
- Data distribution debugging in machine learning pipelines. The VLDB Journal 31, 5 (2022), 1103–1126.
- Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows. arXiv preprint arXiv:2312.11681 (2023).
- The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets. http://learningsys.org/nips17/assets/papers/paper_19.pdf
- Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648 (2023).
- Finding label and model errors in perception data with learned observation assertions. In Proceedings of the 2022 International Conference on Management of Data. 496–505.
- Model assertions for monitoring and improving ML models. Proceedings of Machine Learning and Systems 2 (2020), 481–496.
- EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023).
- CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In International conference on software engineering (ICSE).
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
- Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1542–1551. https://doi.org/10.1145/3394486.3403205
- LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
- Challenges in deploying machine learning: a survey of case studies. Comput. Surveys 55, 6 (2022), 1–29.
- Revisiting Prompt Engineering via Declarative Crowdsourcing. arXiv preprint arXiv:2308.03854 (2023).
- Building Your Own Product Copilot: Challenges, Opportunities, and Needs. arXiv:2312.14231 [cs.SE]
- Snorkel: Rapid training data creation with weak supervision. The VLDB Journal 29, 2-3 (2020), 709–730.
- Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023).
- Ease.Ml/Ci and Ease.Ml/Meter in Action: Towards Data Management for Statistical Generalization. Proc. VLDB Endow. 12, 12 (aug 2019), 1962–1965. https://doi.org/10.14778/3352063.3352110
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476 (2023).
- An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv preprint arXiv:2302.06527 (2023).
- Proactively Screening Machine Learning Pipelines with ARGUSEYES. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD ’23). Association for Computing Machinery, New York, NY, USA, 91–94. https://doi.org/10.1145/3555041.3589682
- Automating Large-Scale Data Quality Verification. Proc. VLDB Endow. 11, 12 (aug 2018), 1781–1794. https://doi.org/10.14778/3229863.3229867
- Automatic and Precise Data Validation for Machine Learning. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2198–2207.
- Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125 (2022).
- Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473 (2022).
- SPADE: Automatically digging up evals based on prompt refinements. https://blog.langchain.dev/spade-automatically-digging-up-evals-based-on-prompt-refinements/
- Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150 (2022).
- Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
- DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023).
- Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv preprint arXiv:2310.02368 (2023).
- Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–10.
- Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22.
- Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective. arXiv preprint arXiv:2305.14889 (2023).
- Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
- Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862 (2023).
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).