Grammar Prompting for Domain-Specific Language Generation with Large Language Models
(2305.19234v3)
Published 30 May 2023 in cs.CL and cs.AI
Abstract: LLMs can learn to perform a wide range of natural language tasks from just a handful of in-context examples. However, for generating strings from highly structured languages (e.g., semantic parsing to complex domain-specific languages), it is challenging for the LLM to generalize from just a few exemplars. We propose \emph{grammar prompting}, a simple approach to enable LLMs to use external knowledge and domain-specific constraints, expressed through a grammar in Backus--Naur Form (BNF), during in-context learning. Grammar prompting augments each demonstration example with a specialized grammar that is minimally sufficient for generating the particular output example, where the specialized grammar is a subset of the full DSL grammar. For inference, the LLM first predicts a BNF grammar given a test input, and then generates the output according to the rules of the grammar. Experiments demonstrate that grammar prompting can enable LLMs to perform competitively on a diverse set of DSL generation tasks, including semantic parsing (SMCalFlow, Overnight, GeoQuery), PDDL planning, and SMILES-based molecule generation.
LLMs have demonstrated impressive few-shot learning capabilities, but generating strings that conform to the strict syntax of domain-specific languages (DSLs) remains challenging. This difficulty arises because DSLs often have complex, domain-specific structures that are unlikely to be fully learned during pretraining and are hard to capture with just a few examples. The paper "Grammar Prompting for Domain-Specific Language Generation with LLMs" (Wang et al., 2023) proposes grammar prompting, a method that leverages the ability of LLMs to process formal grammars to improve few-shot DSL generation.
The core idea is to augment the standard few-shot prompting approach with explicit grammar information. The method assumes access to a full Backus–Naur Form (BNF) grammar G defining the DSL's syntax. Instead of just providing input-output pairs (x(i),y(i)), grammar prompting includes a specialized grammarG[y(i)] for each example. This specialized grammar is defined as a minimal subset of the full grammar G that is sufficient to generate the specific output string y(i). It can be automatically derived by parsing y(i) using the full grammar G and collecting the rules used in the derivation.
The process involves a two-step inference:
Given a new test input x, the LLM is first prompted to predict a specialized grammar G that is likely to be sufficient for the target output. The prompt for this step consists of (x(i),G[y(i)],y(i)) examples for N shots, followed by x and a request for the grammar. The LLM outputs G.
Once G is predicted, the LLM is prompted again, this time conditioned on G and x, to generate the final output program y. The prompt structure is similar, but the request is for the output string y given the predicted grammar G.
This two-step approach can be viewed as a form of Chain-of-Thought prompting where the intermediate "thought" is a formal grammar rather than natural language. Predicting the specialized grammar first forces the LLM to reason about the necessary structural components of the output before generating the specific tokens.
A key practical challenge when generating structured output with LLMs, especially API-based ones, is ensuring the output is syntactically valid according to the grammar G. The paper proposes an Earley-based constrained decoding algorithm to address this. Standard grammar-constrained decoding often requires querying the LLM at every token step with the updated valid token set, which is computationally expensive. The proposed method uses a speculative decoding approach:
At each step, the LLM speculatively decodes a sequence of tokens (a potential continuation).
An Earley parser checks if the concatenated prefix and speculative continuation form a valid prefix according to G.
If the full speculative continuation leads to a syntactically complete and valid program, it's returned.
If the full continuation is invalid, the Earley parser finds the longest valid prefix within the predicted sequence and identifies the set of valid next terminals according to G.
The LLM's probabilities (or an alternative scoring mechanism like Sentence-BERT similarity if direct logprobs are unavailable or too expensive) are used to select the most likely valid terminal from this set.
The selected valid terminal replaces the incorrect tokens, forming a new, corrected prefix, and the process repeats.
This constrained decoding ensures syntactic correctness, although the paper notes that it significantly increases the number of API calls compared to unconstrained decoding. They found that while useful for guaranteeing validity, the constraints were not always beneficial for metrics beyond correctness, like diversity in molecule generation or object selection in PDDL planning.
The paper evaluates grammar prompting across several diverse DSL domains:
Semantic Parsing: Translating natural language queries into DSL programs for calendar management (SMCalFlow), geography queries (GeoQuery), and a synthetic block world (Overnight-Blocks). Experiments in a true few-shot setting (16-32 examples) with Codex-davinci-002 show significant improvements over standard prompting and a derivation tree baseline. For instance, on GeoQuery (32-shot), grammar prompting with constraints achieved 69.6% program accuracy and 88.9% execution accuracy, compared to standard prompting's 60.7% and 81.5%. The results indicate that predicting the specialized grammar acts as an effective "planning" step. The method also showed improvements in retrieval-based settings and, notably, on out-of-distribution generalization tasks for GeoQuery, particularly for tasks requiring generalization to new, unseen functions, suggesting that working with the grammar level helps the LLM understand the structure better. Experiments with GPT-3.5, GPT-4, and PaLM 2-L also generally favored grammar prompting.
Molecule Generation: Generating class-specific molecules in SMILES format from a small set of examples of that class. Here, the "input" is empty, and the task is to sample novel, valid, synthesizable molecules of a specific type. Using GPT-3.5, grammar prompting improved metrics like Validity (V), Diversity (D), Retrosynthesis score (R), and Membership (M) for Acrylates and Chain Extenders compared to standard prompting and a graph grammar baseline. For Acrylates, grammar prompting achieved V=98.0, D=0.74, R=91.0, M=93.3 compared to standard prompting's V=87.7, D=0.73, R=80.0, M=76.7. The approach demonstrates LLMs' potential for generating chemical structures when guided by grammar constraints implicitly learned from class-specific examples.
PDDL Planning: Guiding a classical AI planner (GBFS) by predicting a specialized action DSL grammar containing relevant actions for a given planning task (initial state + goal state). The specialized grammar helps restrict the search space. Using GPT-3.5, grammar prompting improved the efficiency of the GBFS planner by reducing the number of created and expanded search nodes while maintaining or improving the success rate in the Blocks, Depot, and Satellite domains. For example, in the Blocks domain with macro actions, grammar prompting reduced expanded nodes from 16 (Standard + Macro) to 9 while maintaining 100% success. This application highlights how LLMs can assist symbolic AI systems by performing high-level reasoning tasks like action subset selection.
The paper notes limitations: grammar prompting did not improve performance on DSLs likely to be heavily represented in pretraining data (e.g., SQL, regular expressions). Constrained decoding, while ensuring validity, can increase cost significantly and sometimes negatively impact sample diversity or object selection effectiveness.
Overall, grammar prompting provides a simple, yet effective, technique for enhancing the few-shot generation capabilities of LLMs for complex DSLs by explicitly incorporating formal grammar knowledge into the prompting mechanism. The results suggest that LLMs possess an implicit understanding of metalanguages like BNF and can leverage this understanding when prompted appropriately, opening avenues for applying LLMs in domains that rely heavily on structured languages and formal constraints.