Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines (2401.03038v2)

Published 5 Jan 2024 in cs.DB and cs.SE

Abstract: LLMs are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose {\em data quality assertions} to identify when LLMs may be making mistakes. We present SPADE, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. SPADE therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14\% and decreases false failures by 21\% when compared to simpler baselines. SPADE has been deployed as an offering within LangSmith, LangChain's LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Data Platform for Machine Learning. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1803–1816. https://doi.org/10.1145/3299869.3314050
  2. Prediction-powered inference. arXiv preprint arXiv:2301.09633 (2023).
  3. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023).
  4. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 (2022).
  5. Data Validation for Machine Learning. In Proceedings of SysML. https://mlsys.org/Conferences/2019/doc/2019/167.pdf
  6. Rui Castro and Robert Nowak. 2008. Active learning and sampling. In Foundations and Applications of Sensor Management. Springer, 177–200.
  7. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023).
  8. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  9. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023).
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  11. Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains. arXiv preprint arXiv:2306.12028 (2023).
  12. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  13. Is GPT-3 text indistinguishable from human text? SCARECROW: A framework for scrutinizing machine text. arXiv preprint arXiv:2107.01294 (2021).
  14. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217 (2023).
  15. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
  16. Towards Autonomous Testing Agents via Conversational Large Language Models. arXiv preprint arXiv:2306.05152 (2023).
  17. How Large Language Models Will Disrupt Data Management. Proc. VLDB Endow. 16, 11 (jul 2023), 3302–3309. https://doi.org/10.14778/3611479.3611527
  18. John Forrest and Robin Lougee-Heimer. 2005. CBC user guide. In Emerging theory, methods, and applications. INFORMS, 257–277.
  19. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research 77 (2023), 103–166.
  20. Data distribution debugging in machine learning pipelines. The VLDB Journal 31, 5 (2022), 1103–1126.
  21. Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows. arXiv preprint arXiv:2312.11681 (2023).
  22. The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets. http://learningsys.org/nips17/assets/papers/paper_19.pdf
  23. Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648 (2023).
  24. Finding label and model errors in perception data with learned observation assertions. In Proceedings of the 2022 International Conference on Management of Data. 496–505.
  25. Model assertions for monitoring and improving ML models. Proceedings of Machine Learning and Systems 2 (2020), 481–496.
  26. EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023).
  27. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In International conference on software engineering (ICSE).
  28. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  29. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
  31. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1542–1551. https://doi.org/10.1145/3394486.3403205
  32. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
  33. Challenges in deploying machine learning: a survey of case studies. Comput. Surveys 55, 6 (2022), 1–29.
  34. Revisiting Prompt Engineering via Declarative Crowdsourcing. arXiv preprint arXiv:2308.03854 (2023).
  35. Building Your Own Product Copilot: Challenges, Opportunities, and Needs. arXiv:2312.14231 [cs.SE]
  36. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal 29, 2-3 (2020), 709–730.
  37. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023).
  38. Ease.Ml/Ci and Ease.Ml/Meter in Action: Towards Data Management for Statistical Generalization. Proc. VLDB Endow. 12, 12 (aug 2019), 1962–1965. https://doi.org/10.14778/3352063.3352110
  39. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476 (2023).
  40. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv preprint arXiv:2302.06527 (2023).
  41. Proactively Screening Machine Learning Pipelines with ARGUSEYES. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD ’23). Association for Computing Machinery, New York, NY, USA, 91–94. https://doi.org/10.1145/3555041.3589682
  42. Automating Large-Scale Data Quality Verification. Proc. VLDB Endow. 11, 12 (aug 2018), 1781–1794. https://doi.org/10.14778/3229863.3229867
  43. Automatic and Precise Data Validation for Machine Learning. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2198–2207.
  44. Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125 (2022).
  45. Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473 (2022).
  46. SPADE: Automatically digging up evals based on prompt refinements. https://blog.langchain.dev/spade-automatically-digging-up-evals-based-on-prompt-refinements/
  47. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150 (2022).
  48. Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
  49. DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines. arXiv preprint arXiv:2312.13382 (2023).
  50. Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv preprint arXiv:2310.02368 (2023).
  51. Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
  52. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
  53. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  54. Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–10.
  55. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22.
  56. Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective. arXiv preprint arXiv:2305.14889 (2023).
  57. Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
  58. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862 (2023).
  59. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
Citations (4)

Summary

  • The paper proposes a two-component framework using prompt deltas and ILP-based filtering to automatically generate and select assertions for LLM pipelines.
  • It demonstrates a 14% reduction in assertion counts and a 21% decrease in false failure rates across nine diverse LLM pipelines.
  • The system is validated in real-world deployments with over 1300 runs, providing a practical advance in enhancing LLM pipeline reliability.

Synthesizing Assertions for LLM Pipelines: An Evaluation of "spade"

The paper "spade: Synthesizing Assertions for LLM Pipelines" addresses the challenges inherent in deploying LLMs within data-generation pipelines. Given the propensity for errors and unpredictable outputs from LLMs, the authors propose a systematic method to automatically generate assertions aimed at identifying incorrect LLM outputs within these pipelines.

Overview of Methodology

This work is structured around a two-component framework, spade, designed to enhance the robustness of LLM data pipelines. The core contributions include:

  1. Assertion Generation from Prompt Deltas: The authors introduce prompt deltas, defined as differences between consecutive versions of a prompt template, as a rich source of assertion criteria. A taxonomy is developed to classify prompt deltas into categories such as Response Format Instruction, Workflow Description, and Qualitative Criteria. By leveraging nine real-world pipelines, the authors extract criteria from prompt deltas and utilize LLMs to generate candidate assertion functions.
  2. Filtering Assertions via ILP and Subsumption: The candidate assertions, often redundant or inaccurate, necessitate filtering. The authors formalize an Integer Linear Program (ILP) to select a minimal set of assertions fulfilling coverage and accuracy requirements. Furthermore, they introduce a novel "subsumption" strategy, ensuring that selected assertions can collectively cover potential failures while adhering to predefined thresholds for false failure rates (FFR) and coverage.

Empirical Evaluation

The paper details an extensive evaluation over nine diverse LLM pipelines, including tasks from generating negotiation strategies to summarizing lectures. This evaluation demonstrates:

  • Reduction in Assertion Count: spade efficiently reduces the number of assertions by 14% on average when leveraging subsumption, compared to a baseline that does not employ filtering comprehensively.
  • Decrease in False Failure Rates: A reduction of 21% in false failure rates is achieved, underscoring the effectiveness of the filtering mechanism in maintaining assertion accuracy while avoiding unnecessary duplication.
  • Practical Deployment and User Adoption: The prototype spade system, deployed via a web interface, garnered significant user engagement with over 1300 runs, highlighting its utility in real-world applications.

Theoretical Implications and Future Directions

The paper substantially contributes to the field by formulating the assertion selection problem as NP-hard and proposing solutions that balance between computational feasibility and coverage comprehensiveness. The use of LLMs both as a source of assertions and as evaluators in the pipeline reflects an innovative recursion of AI for AI, indicating potential for further automation in LLM deployment tasks.

Future developments might explore expanding spade's applicability to more complex, multi-step LLM pipelines, or enhancing robustness using advanced prompt engineering strategies. Integrating passive data collection methods for labeled examples and developing methodologies to continuously refine assertions in line with evolving LLM capabilities would be valuable extensions.

In summary, by proposing spade, the authors advance automated methods to improve reliability in LLM-driven tasks, providing an essential tool for researchers and practitioners aiming to utilize LLMs more confidently in production environments. This alignment of assertion synthesis with LLM pipeline dynamics illustrates a practical step forward in operationalizing AI amidst the inherent unpredictability of model outputs.