Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation (2403.06988v1)

Published 7 Feb 2024 in cs.LG and cs.CL

Abstract: To ensure that text generated by LLMs is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$\times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gemini: A family of highly capable multimodal models. ArXiv preprint, 2023. URL https://arxiv.org/abs/2312.11805.
  2. Prompting is programming: A query language for large language models. Proc. ACM Program. Lang., (PLDI), 2023.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  4. Accelerating large language model decoding with speculative sampling. ArXiv preprint, 2023.
  5. Evaluating large language models trained on code. ArXiv preprint, 2021.
  6. Training verifiers to solve math word problems. ArXiv preprint, 2021.
  7. Grammar-constrained decoding for structured NLP tasks without finetuning. In EMNLP, 2023a.
  8. Grammar-constrained decoding for structured nlp tasks without finetuning. In Proc. of EMNLP, 2023b.
  9. llama.cpp: Port of facebook’s llama model in c/c++. URL https://github.com/guidance-ai/guidance.
  10. Mistral 7b. ArXiv preprint, 2023.
  11. Mixtral of experts. ArXiv preprint, 2024.
  12. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP, 2018.
  13. The Art of Prompt Design: Prompt Boundaries and Token Healing.
  14. Guidance-ai/guidance: A guidance language for controlling large language models. URL https://github.com/guidance-ai/guidance.
  15. Regular expressions and state graphs for automata. IRE Trans. Electron. Comput., (1), 1960.
  16. OpenAI. GPT-4 technical report. ArXiv preprint, 2023.
  17. Synchromesh: Reliable code generation from pre-trained language models. In Proc. of ICLR, 2022.
  18. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proc. of EMNLP, 2021.
  19. Neural machine translation of rare words with subword units. In Proc. of ACL, 2016.
  20. Thompson, K. Regular expression search algorithm. Commun. ACM, (6), 1968.
  21. Tjong Kim Sang, E. F. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), 2002.
  22. Llama: Open and efficient foundation language models. ArXiv preprint, 2023a.
  23. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, 2023b.
  24. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, 2023c.
  25. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
  26. Efficient guided generation for large language models. ArXiv preprint, 2023.
  27. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Luca Beurer-Kellner (8 papers)
  2. Marc Fischer (30 papers)
  3. Martin Vechev (103 papers)
Citations (13)

Summary

  • The paper introduces DOMINO, a novel algorithm that addresses token misalignment in LLMs to enforce syntactic constraints without compromising performance.
  • It employs pre-computation and speculative decoding to achieve minimal intervention while ensuring efficient, low-overhead constrained generation.
  • Experimental results demonstrate that DOMINO outperforms baselines, achieving up to 1.77x throughput improvement on tasks like JSON format generation.

Overview of the Paper: Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

The paper "Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation" by Luca Beurer-Kellner, Marc Fischer, and Martin Vechev presents a novel approach to constraining the output of LLMs to adhere to specified syntactic requirements without causing significant performance degradation or accuracy loss. Traditional methods of constrained decoding often incur performance overhead and can misalign with the sub-word vocabularies of LLMs, leading to decreased task accuracy. The paper introduces a new constrained decoding algorithm called DOMINO, which aims to address these challenges through a minimally-invasive approach.

Background and Motivation

The success of LLMs has led to a rising interest in constrained generation techniques, which are crucial for tasks requiring outputs to follow strict syntactic structures such as JSON, code, or specific grammatical constructs. Existing constrained decoding techniques often either compromise task accuracy or introduce overheads that can be impractical for real-time or high-throughput applications. This paper identifies token misalignment—a key challenge where the LLM's vocabulary does not align with the syntactic constraints—as a significant factor leading to reduced performance.

Key Contributions and Methodology

The authors propose the DOMINO algorithm, which ensures that constrained generation is minimally invasive by allowing all valid tokens that could be generated by an unconstrained model under the same prompt. To achieve this, DOMINO operates in a fully subword-aligned fashion:

  1. Token Misalignment Solution: DOMINO resolves the token misalignment issue by aligning LLM sub-word tokens with grammar terminals effectively, allowing faithful and low-perplexity outputs.
  2. Pre-computation and Speculative Decoding: The algorithm uses pre-computation and speculative decoding techniques to achieve low-overhead generation. Pre-computation is employed to allow efficient traversal of vocabulary-aligned subterminal trees, while speculative sampling aids in accelerating token prediction without a full vocabulary scan.
  3. Opportunistic Masking: The paper introduces opportunistic masking within DOMINO, where initially the LLM's proposed token is checked against constraints. Only when this token does not adhere to the constraints, further checks are performed. This allows for faster, minimally invasive intervention only when necessary.

Evaluation and Results

In experiments conducted on models such as Mistral 7B and Llama-2 13B, DOMINO demonstrates significantly reduced overhead compared to other methods while maintaining or improving task accuracy. For instance, on constraints involving JSON format generation from the GSM8K dataset, DOMINO not only exceeded the unconstrained generation throughput by up to 1.77x but also slightly increased accuracy over the unconstrained methods.

The method consistently outperforms existing baseline approaches like llama.cpp and guidance-based templates, which often suffer from slower throughput and reduced accuracy due to invasive tok-en constraining. DOMINO's approach of minimal intervention allows generated content to remain as natural and low-perplexity as possible.

Implications and Future Directions

The introduction of DOMINO holds considerable implications for the field of machine learning, particularly in areas involving real-time data processing where speed and accuracy are critical. Practical applications extend to code generation, structured data synthesis, and any domain where output must adhere to strict syntactic rules. Theoretical advancements outlined in the paper lay foundational work for further optimizing LLMs' post-processing, and the potential integration with other techniques such as fine-tuning or model compression could be explored.

Overall, DOMINO provides a significant step toward efficient and accurate constrained text generation, addressing critical challenges in the utilization of LLMs across diverse applications. Future research may focus on extending this approach to adaptive, context-sensitive decoding strategies and exploring applications in multilingual or multi-domain settings.

Youtube Logo Streamline Icon: https://streamlinehq.com