BeNYfits Benchmark for Eligibility Dialog Agents

Updated 18 December 2025

BeNYfits is a benchmark that defines interactive dialogs for evaluating eligibility across 82 NYC public-benefit programs.
It operationalizes dynamic, multi-turn conversational tasks to replace static forms, ensuring high accuracy and reduced dialog turns.
Methodologies like ProADA leverage program synthesis to enhance performance while highlighting challenges such as logical errors and dialog inefficiencies.

BeNYfits is a large-scale benchmark for evaluating interactive decision-making dialog agents in the domain of user eligibility determination for multiple overlapping social benefits programs. Developed to address the limitations of static form-based methods and the complexity inherent to legal and public benefits regulations, BeNYfits operationalizes eligibility as a dynamic, multi-program conversational task, setting precise standards for accuracy and efficiency in real-world agent deployments (Toles et al., 26 Feb 2025).

1. Task Formulation and Motivational Context

BeNYfits targets eligibility problems where definitive answers (typically binary choices) depend on user-specific attributes and where such rules are documented in natural language. The benchmark is motivated by the challenges of scaling hard-coded systems (e.g., static dialog trees or forms) to domains characterized by frequent policy changes and overlapping requirements. Within BeNYfits, agents interactively query users regarding eligibility criteria for a given subset of 82 New York City public-benefit programs, which span tax credits, housing, nutrition, healthcare, and related categories.

At each dialog turn, the agent may pose a natural-language question or choose to terminate and make predictions (the “Ready?” protocol). The primary objectives are twofold:

Maximize micro-F1 on multi-program eligibility prediction.
Minimize the number of dialog turns required to reach decision confidence.

2. Dataset Specification and Construction

BeNYfits is comprised of two rigorously defined splits, both constructed from NYC Open Data sources:

Diverse Dataset: Covers all unique code traces by sampling 56 synthetic households, yielding 305 user-opportunity instances. Households query between 1 and 10 programs each, selected through input fuzzing and greedy coverage techniques.
Representative Dataset: Reflects realistic demographics, sampling 25 households from population statistics (NYC/U.S.), each querying 6 to 19 programs (mean: 9.8).

Each program (“opportunity”) is characterized by 1 to 18 user/household features (mean: 4.66, std: 3.56). Feature prevalence is long-tailed; for example, “age” appears in 53 programs and per-feature reuse spans 1 to 52 opportunities (mean: 3.25, std: 6.66). Households consist of up to 6 members, sampled over variables such as age, income, and foster care status, subjected to logical constraints (e.g., excluding impossible familial relations).

User simulation is performed with Llama 3.1 70B, generating natural-language responses from ground-truth structured profiles. Eligibility labels are determined by hand-written Python “eligibility checker” scripts, crafted per program requirement, ensuring deterministic Boolean labeling from structured inputs and tracked code trace coverage.

3. Interactive Agent Protocol

At initialization, agents are privy only to plain-English eligibility descriptions for the queried programs. The dialog loop proceeds as follows:

The agent issues a single question per turn.
Following the simulated user’s answer, the agent is prompted “Ready? True/False.”
If False, the process repeats (up to 20 questions per program, 100 total). If True, the agent outputs an 82-dimensional Boolean eligibility vector.

A distinguishing feature is the ProADA agent. ProADA generates, via code synthesis, a Python Decide function for each opportunity. It operates iteratively:

Invokes Decide with an initially empty feature dictionary (“hh”).
Handles missing feature exceptions (KeyError), prompting targeted questions.
Parses the user response, updates “hh,” and re-invokes Decide, repeating until a decision is reached.

4. Evaluation Metrics

BeNYfits employs class-imbalance-aware micro-metrics:

Precision: $TP/(TP + FP)$
Recall: $TP/(TP + FN)$
Micro-F1: $\mathrm{F1}=2\times\frac{\mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

To jointly penalize excessive dialog length, a turn-weighted F1 metric is defined: $\text{Turn-Weighted F1} = \frac{100 \cdot \mathrm{F1}}{T/100 + 1}$ where $T$ denotes total dialog turns (capped at 100).

5. Methodological Baselines and Empirical Results

The benchmark compares three agent methodologies:

Direct Prompting: Standard LLM prompting, no explicit reasoning-chain.
ReAct Agent: ReAct-style chain-of-thought (CoT) with action annotation.
ProADA: Program synthesis for eligibility function generation, coupled with dialog module for feature extraction.

Tested LLMs include Llama 3.1 Instruct 8B, Llama 3.1 70B, and GPT-4o (all for dialog; ProADA’s synthesis consistently uses GPT-4o). Performance across both datasets is summarized below:

Method	Model	F1	Turns	Turn-Weighted F1
Direct Prompting	Llama 3.1 8B	33.6	6.0	31.7
	Llama 3.1 70B	40.8	61.7	25.2
	GPT-4o	38.1	27.2	29.9
ReAct Agent	Llama 3.1 8B	33.2	10.2	30.2
	Llama 3.1 70B	42.2	23.7	34.1
	GPT-4o	35.7	15.8	30.8
ProADA (program synthesis)	Llama 3.1 8B	46.2	20.1	38.5
	Llama 3.1 70B	51.8	19.0	43.5
	GPT-4o	55.6	16.5	47.7

ProADA with GPT-4o delivers an absolute F1 of 55.6 and turn-weighted F1 of 47.7, representing over 30% relative improvement against the strongest baseline (Llama 70B+ReAct at 42.2 F1), while realizing greater dialog efficiency.

6. Identified Failure Modes and Agent Limitations

Baseline agents exhibit several deficiencies:

Hallucinations, yielding premature “Ready=True” decisions or fabricating contextually inappropriate facts (e.g., inferring children where none exist).
Over-specific inquiries, resulting in suboptimal questioning (e.g., unnecessary numeric thresholds).
Repetitive or looping dialog behavior with minimal semantic variation.
Faulty household modeling, neglecting household member coverage or conflating individual attributes.

ProADA introduces new error sources:

Logical errors in code synthesis propagate to dialog inefficiencies or incorrect eligibility computation.
Challenges in translating variable references (e.g., hh[2]) to precise, comprehensible questions.
Lack of inherent recovery mechanisms for unanticipated or incomplete user responses (e.g., “I don’t know”).

7. Benchmark Significance and Research Impact

BeNYfits represents the inaugural large-scale benchmark for multi-opportunity public-benefit eligibility, explicitly contextualized as an interactive dialog task. It captures the complexity of overlapping, long-tailed eligibility requirements and stresses agent performance across user experience metrics (i.e., turn count) and predictive accuracy, operationalized through the novel turn-weighted F1 metric.

By introducing the ProADA framework—mapping dialog planning and data acquisition onto program synthesis—BeNYfits demonstrates an effective approach for structuring robust, inspectable dialog agent architectures. The benchmark establishes foundational standards for tool-augmented LLM agents in high-stakes, user-facing decision settings and invites future work on scalable, transparent eligibility assessment paradigms (Toles et al., 26 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

Program Synthesis Dialog Agents for Interactive Decision-Making (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BeNYfits Benchmark.