Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines (2312.13382v2)

Published 20 Dec 2023 in cs.CL, cs.AI, and cs.PL

Abstract: Chaining LLM (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic "prompt engineering". We introduce LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. We also propose strategies to use assertions at inference time for automatic self-refinement with LMs. We report on four diverse case studies for text generation and find that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. Our reference implementation of LM Assertions is integrated into DSPy at https://github.com/stanfordnlp/dspy

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Jass - java with assertions. In Havelund, K. and Rosu, G. (eds.), Workshop on Runtime Verification, RV 2001, in connection with CAV 2001, Paris, France, July 23, 2001, volume 55 of Electronic Notes in Theoretical Computer Science, pp.  103–117. Elsevier, 2001. doi: 10.1016/S1571-0661(04)00247-6. URL https://doi.org/10.1016/S1571-0661(04)00247-6.
  2. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Chase, H. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.
  5. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
  6. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627, 2023.
  7. Trueteacher: Learning factual consistency evaluation with large language models. arXiv preprint arXiv:2305.11171, 2023.
  8. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.
  9. Lexically constrained decoding for sequence generation using grid beam search. arXiv preprint arXiv:1704.07138, 2017.
  10. Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  839–850, 2019.
  11. Model assertions for monitoring and improving ml models. Proceedings of Machine Learning and Systems, 2:481–496, 2020.
  12. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. Advances in Neural Information Processing Systems, 34:27670–27682, 2021.
  13. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp, 2022.
  14. Dspy: Compiling declarative language model calls into self-improving pipelines. CoRR, abs/2310.03714, 2023. doi: 10.48550/ARXIV.2310.03714. URL https://doi.org/10.48550/arXiv.2310.03714.
  15. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  16. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  17. On the synthesis of a reactive module. In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp.  179–190, 1989.
  18. Python Software Foundation. 7. simple statements. https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement, 2023. Accessed: 2023-12-01.
  19. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559, 2023.
  20. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails, 2023.
  21. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2021.
  22. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  23. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961, 2023.
  24. Solar-Lezama, A. Program synthesis by sketching. University of California, Berkeley, 2008.
  25. Solar-Lezama, A. The sketching approach to program synthesis. In Asian symposium on programming languages and systems, pp.  4–13. Springer, 2009.
  26. Template-based program verification and program synthesis. International Journal on Software Tools for Technology Transfer, 15:497–518, 2013.
  27. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022.
  28. Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516, 2023.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  30. Pinpoint, not criticize: Refining large language models via fine-grained actionable feedback. arXiv preprint arXiv:2311.09336, 2023.
  31. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  32. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  33. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311, 2023.
Citations (11)

Summary

  • The paper presents LM Assertions, a novel construct to enforce computational constraints in language model pipelines for improved accuracy.
  • It integrates these assertions into DSPy using techniques like assertion-driven backtracking and counterexample bootstrapping for dynamic self-refinement.
  • Experimental results show significant performance boosts, including a 7.9% improvement in retrieval recall and near-complete validity in generated quiz formats.

DSPy Assertions: Computational Constraints for Self-Refining LLM Pipelines

The paper entitled "DSPy Assertions: Computational Constraints for Self-Refining LLM Pipelines" introduces a novel construct termed LM Assertions, which enhances the reliability and accuracy of LLMs when used as part of more complex computational pipelines. Specifically, the paper integrates these constructs into the recent DSPy programming model.

Abstract

The abstract highlights the need to ensure that LMs adhere to specific constraints, which traditionally requires heuristic "prompt engineering." LM Assertions offer a declarative approach for expressing computational constraints that LMs should satisfy. By integrating LM Assertions into DSPy, the paper presents new strategies to compile programs with these assertions, thereby achieving more reliable and accurate systems. The core contribution is in leveraging these assertions at inference time for automatic self-refinement, demonstrating significant improvements in multiple task-specific contexts.

Introduction

LLMs are increasingly central to various applications, yet their probabilistic nature can lead to outputs that fall outside the desired domain constraints. Existing techniques like constrained decoding and heuristic prompt engineering are labor-intensive and often brittle. The introduction of LM Assertions provides a more systematic and extensible way to ensure LMs adhere to necessary computational constraints.

Contributions

  1. LM Assertions: A programming construct to enforce constraints on LM outputs within a pipeline.
  2. Assertion-Driven Backtracking: Use of LM Assertions during inference to retry and refine outputs dynamically.
  3. Assertion-Driven Example Bootstrapping: Enhanced prompt optimization by incorporating assertions into the example selection process, creating more robust few-shot examples.
  4. Counterexample Bootstrapping: Development of demonstrations containing failed examples and their corrections to improve the LM’s reliability and adherence to constraints.

Motivating Example and Case Studies

The paper details a motivating example involving multi-hop question answering. By incorporating simple assertions—such as query length restrictions and ensuring distinct queries per retrieval hop—the pipeline demonstrates significant improvements in performance metrics like retrieval recall and answer accuracy.

LongFormQA

In a LongFormQA task, the incorporation of LM Assertions ensures that long-form answers include citations and are faithful to their retrieved context. Metrics include citation faithfulness, recall, precision, and answer correctness.

QuizGen

For generating quiz questions in JSON format, LM Assertions ensure correct formatting, inclusion of the correct answer, and the validity of distractor choices. The program significantly improved the consistency and quality of generated quizzes, as shown by a rise in valid JSON format completion from 37.6% to 98.8%.

TweetGen

This task generates tweets as answers to questions. LM Assertions check for characteristics like length, engagement, faithfulness, and the inclusion of the correct answer. While ensuring intrinsic tweet quality, metrics include additional attributes like the presence of hashtags and the overall quality score.

Evaluation

The paper reports on diverse tasks evaluating intrinsic and extrinsic performance metrics. LM Assertions facilitated substantial improvements in passing constraints and enhancements in downstream task performance. For instance, MultiHopQA’s retrieval recall improved by up to 7.9%, and the validity of quiz questions in QuizGen surged from 30.5% to 87.2%.

Implications and Future Directions

The practical implications of integrating LM Assertions into LM pipelines are considerable. Beyond enhancing reliability and accuracy, the constructs simplify debugging and understanding LM behaviors in complex applications. The integration with DSPy also opens up avenues for more robust and automated prompt optimization techniques.

Conclusion

LM Assertions offer a structured and extensible way to enforce constraints and improve the overall reliability of LLM pipelines. By enabling dynamic self-refinement and robust prompt optimization, the research sets a foundational step toward making large-scale LLMs more controllable and predictable.

Speculation on Future Developments

The research heralds further exploration into combining LM Assertions with fine-grained control mechanisms and integrating with new LM frameworks. Future developments could focus on automating the generation of LM Assertions and exploring their utility in more diverse and complex applications, potentially broadening the scope of AI systems capable of self-governance and higher-level abstract reasoning.

By introducing LM Assertions and integrating them into DSPy, the paper provides a promising framework for advancing the accuracy and reliability of LLM pipelines. The implications on both practical and theoretical fronts are promising, suggesting new avenues for developing more sophisticated, self-regulating AI systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com