ProADA: A Program Synthesis Framework
- ProADA is a program synthesis–driven framework for interactive dialog agents that converts natural-language eligibility criteria into verifiable executable decision logic.
- It integrates a code-generation module with an error-driven dialog loop to detect missing features and target user queries, ensuring accurate input capture.
- Validated on the BeNYfits benchmark, ProADA achieves a 19.9% F1 improvement over traditional methods while maintaining similar dialog turn lengths.
ProADA is a program synthesis–driven framework for interactive dialog agents that address decision-making tasks where eligibility for one or more programs is determined by soliciting user-specific features. ProADA operationalizes dialog planning as an explicit code generation and execution problem, leveraging LLMs to synthesize programmatic “decision checkers” from natural-language eligibility requirements. This approach enables the agent to identify missing information at runtime by intercepting execution errors and to elicit only the necessary user inputs, reducing hallucinations common in free-form LLM prompting. ProADA was introduced and validated in the context of multi-program social benefits eligibility screening, setting new accuracy standards on the BeNYfits benchmark (Toles et al., 26 Feb 2025).
1. Problem Scope and Formalization
ProADA addresses dialog-based eligibility classification over a set of binary-outcome programs (e.g., government benefits, access controls). At each dialog turn, the agent must either:
- Query the user for a missing feature relevant to eligibility, or
- Issue a set of decisions for all .
Let the state at time be , a partial assignment mapping feature keys to values drawn from a pre-specified set of types and domains.
The action space is defined as
where elicits feature and triggers prediction.
The optimization objective is a cost-sensitive tradeoff between accuracy and dialog length, commonly through: or, equivalently, maximizing where is the total number of questions asked.
2. System Architecture and Execution Loop
ProADA is characterized by two tightly integrated modules:
- Code-Generation Module:
- Consumes natural-language eligibility requirements.
- Synthesizes a deterministic Python function (“decide”) that computes decision outcomes over a feature dictionary , e.g.,
1 2
def decide(hh: dict) -> Union[bool, List[bool]]: # logic using hh[k] for all required k, raising KeyError if missing
- All decision branching resides in the generated code; no inference is performed at runtime beyond code execution.
- Dialog Loop (Error-Gap Driven):
- Initializes with an empty feature dict .
- Executes “decide(hh)” inside a try/except block:
- If a decision is returned, the agent halts and outputs the result.
- If a KeyError for feature is raised, the dialog module generates a targeted natural-language question for , parses the user’s answer, type-checks and normalizes to , and sets .
- This process repeats until no further features are missing.
LaTeX-format pseudocode sketch: \begin{algorithm}[h] \caption{ProADA Main Loop} \begin{algorithmic}[1] \State \State $\mathrm{Decide}\gets\Call{CodeGen}{\text{requirements}}$ \Loop \Try \State \State \Return \Except{\KeyError\ } \State $q\gets\Call{MakeQuestion}{k,\mathrm{Decide},\mathrm{history}}$ \State $a\gets\Call{UserAnswer}{q}$ \State $v\gets\Call{ExtractValue}{k,a}$ \State \EndTry \EndLoop \end{algorithmic} \end{algorithm}
3. Structured Data Extraction and Gap Management
Eligibility logic is defined over a set of feature keys , each associated with a type constraint and, for types, an explicit domain .
At each execution of decide(hh), the missing feature set is computed as:
The encounter of a KeyError on effectively acts as gap detection, identifying a single unfilled prerequisite at each step and prompting a precisely targeted question to fill it:
This reiterative process continues until all eligibility constraints are satisfied and a decision can be returned.
4. Implementation Details and Hallucination Control
- Code Synthesis is performed by a LLM (GPT-4o), using a prompt schema that enforces literal string keys, prohibits asymmetric accesses (e.g.,
dict.get()), and provides decision logic that reliably exposes missing information via exceptions. - Dialog Modeling is handled by Llama 3.1 70B Instruct (quantized 4-bit), which generates questions and parses answers. Decoding is rigorously constrained to type and domain constraints , ensuring semantic validity.
- Hallucination Mitigation is achieved by (1) centralizing decision logic in verifiable code, (2) restricting user value parsing via constrained decoding, and (3) verifying correctness and type-consistency at every step through code execution.
This architecture ensures that all execution traces are explainable, auditable, and deterministic, as the sequence of queries and predictions is a direct consequence of the synthesized code logic and user-supplied feature values.
5. Benchmark Evaluation and Quantitative Results
ProADA’s evaluation was conducted on the BeNYfits benchmark, which comprises 82 public-benefit programs sourced from NYC Open Data, encapsulating complex eligibility logic (1–18 rules per program, mean 4.66). Two test splits were constructed:
- Diverse Dataset: 305 user–program pairs across 56 synthetic households, covering all code traces.
- Representative Dataset: 246 pairs from 25 households sampled to match demographic distributions.
Dialog protocol allowed up to 20 questions per program (100 total). Key metrics included micro-averaged , average turns , and turn-weighted :
Core results:
| Method | Turns | |
|---|---|---|
| GPT-4o + ReAct | 35.7 | 15.8 |
| ProADA (GPT-4o) | 55.6 | 16.5 |
ProADA delivered a +19.9 percentage point improvement at effectively the same average dialog length (Toles et al., 26 Feb 2025).
6. Error Modes, Trade-Offs, and Future Work
Performance analysis attributes ProADA’s gains to offloading logical dependencies into explicit code, thereby avoiding lost-in-the-middle errors and LLM hallucinations. Constrained value parsing further improves data precision and recall.
Principal failure modes include:
- Code-generation faults due to ambiguous or edge-case eligibility language, resulting in incomplete or incorrect branching.
- Dialog module misalignment when feature keys do not map cleanly to user-understandable questions, especially for indexed attributes (e.g., per-household-member fields).
- Strict adherence to the generated code path, limiting fallback behaviors (e.g., “I don’t know” responses), suggesting no recovery routines for underspecified inputs.
Trade-offs include modest additional latency for code generation and higher up-front engineering but yield stronger interpretability and auditability than black-box LLM approaches.
Planned enhancements target hierarchical questioning, uncertainty integration, partial credit for ambiguous specification mining, and live user studies on fairness metrics.
ProADA establishes a foundational paradigm wherein program synthesis and dialog modeling are fused to enable accurate, efficient, and auditable interactive decision agents, setting a new empirical benchmark for complex eligibility tasks in multi-program screening applications (Toles et al., 26 Feb 2025).