Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation (2403.06988v1)
Abstract: To ensure that text generated by LLMs is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$\times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.
- Gemini: A family of highly capable multimodal models. ArXiv preprint, 2023. URL https://arxiv.org/abs/2312.11805.
- Prompting is programming: A query language for large language models. Proc. ACM Program. Lang., (PLDI), 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Accelerating large language model decoding with speculative sampling. ArXiv preprint, 2023.
- Evaluating large language models trained on code. ArXiv preprint, 2021.
- Training verifiers to solve math word problems. ArXiv preprint, 2021.
- Grammar-constrained decoding for structured NLP tasks without finetuning. In EMNLP, 2023a.
- Grammar-constrained decoding for structured nlp tasks without finetuning. In Proc. of EMNLP, 2023b.
- llama.cpp: Port of facebook’s llama model in c/c++. URL https://github.com/guidance-ai/guidance.
- Mistral 7b. ArXiv preprint, 2023.
- Mixtral of experts. ArXiv preprint, 2024.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP, 2018.
- The Art of Prompt Design: Prompt Boundaries and Token Healing.
- Guidance-ai/guidance: A guidance language for controlling large language models. URL https://github.com/guidance-ai/guidance.
- Regular expressions and state graphs for automata. IRE Trans. Electron. Comput., (1), 1960.
- OpenAI. GPT-4 technical report. ArXiv preprint, 2023.
- Synchromesh: Reliable code generation from pre-trained language models. In Proc. of ICLR, 2022.
- PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proc. of EMNLP, 2021.
- Neural machine translation of rare words with subword units. In Proc. of ACL, 2016.
- Thompson, K. Regular expression search algorithm. Commun. ACM, (6), 1968.
- Tjong Kim Sang, E. F. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), 2002.
- Llama: Open and efficient foundation language models. ArXiv preprint, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, 2023b.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, 2023c.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
- Efficient guided generation for large language models. ArXiv preprint, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, 2019.
- Luca Beurer-Kellner (8 papers)
- Marc Fischer (30 papers)
- Martin Vechev (103 papers)