Yes-No Bundle Outcomes

Updated 14 November 2025

Yes-No Bundle outcomes are systematic aggregations of binary yes/no responses that resolve ambiguity and support nuanced semantic labeling.
They are applied in crowdsourcing, reward modeling, and multimodal evaluation, leveraging specialized annotation schemas like five-class typologies.
Hierarchical Bayesian models and adaptive querying enable efficient aggregation with minimal queries while addressing noise and model laziness.

A Yes-No Bundle (YNB) outcome refers to the systematic interpretation, aggregation, or modeling of responses to yes-no style queries. YNBs arise in domains including crowdsourcing for multi-class labeling, reward modeling in reinforcement learning, social media question answering, and robust evaluation of large multimodal LLMs. The YNB paradigm leverages the efficient binary nature of yes/no decisions but addresses the semantic, statistical, and operational challenges posed by indirect, noisy, or bundled yes/no responses.

1. Core Definitions and Typologies of Yes-No Bundle Outcomes

YNB outcomes encode the result of evaluating a collection (bundle) of yes-no queries, typically to resolve ambiguity, improve aggregation fidelity, or support nuanced downstream decisions. In social/user-generated text, bundles categorize answers as one of several semantic classes—for example, the five-class schema used by Mathur et al. (Mathur et al., 2023): yes (y), no (n), probably yes (py), probably no (pn), and unknown (uk). In crowdsourcing (Cabrera et al. (Saldias et al., 2019)), YNB outcomes refer to the probabilistic inference of an underlying multiclass label via a pattern of binary responses over candidate classes. In reward modeling (ZYN (Gallego, 2023)), YNBs are obtained by querying a LLM reward agent with multiple yes-no questions and aggregating the scalar outputs. In MLLM evaluation (LazyBench (Zhao et al., 15 Oct 2024)), a bundle tracks model consistency across yes/no, open-ended, and multiple-choice queries on the same visual content.

The diversity of YNB contexts necessitates outcome definitions that are semantically rich, robust to indirectness, and compatible with statistical modeling frameworks.

2. Annotation Schemas and Linguistic Phenomena

The integrity of YNB outcomes in user-generated and conversational text heavily depends on annotation schema. Mathur et al. (Mathur et al., 2023) establish a five-way typology:

Label	Description	Example
yes (y)	unreserved affirmative	Q: Do you like X? A: X is delicious.
no (n)	unreserved negative	Q: Do you read in bars? A: Nobody reads there.
probably yes (py)	hedged/conditional yes	Q: Do you bring your phone? A: Depends how long...
probably no (pn)	hedged/conditional no	Q: Is it bad to brake left? A: Only if not a racer.
unknown (uk)	unresponsive/irrelevant	Q: Do you turn on the oven? A: Haven't bought one yet.

Annotator agreement (Cohen’s κ = 0.68) reflects the nontriviality of labeling, particularly due to sarcasm, indirection, and lexical distractors (e.g., negative words in affirming answers).

Linguistic analysis shows only 13.4% of yes-answers contain explicit positive keywords, while 11.98% paradoxically contain negative keywords. Yes answers tend to be longer and richer in numbers; no answers have more negation and negative sentiment. Unknowns are shorter, more varied in verb class, and less likely to contain negation. Emojis correlate strongly with yes.

3. Statistical and Probabilistic Modeling Approaches

YNB aggregation in multi-class crowdsourcing scenarios is formalized via hierarchical Bayesian models (Saldias et al., 2019). Let $N$ objects, $K$ classes, and $J$ labelers define the problem. Each labeler responds to a subset of $(\text{object}, \text{class})$ pairs via yes/no. The model posits:

$z_i \sim \text{Categorical}(\pi)$ : latent class of object $i$
$\pi \sim \text{Dirichlet}(\rho)$ : class prior
$\theta_{k,k'}^j \sim \text{Beta}(\hat{\alpha},\hat{\beta})$ : labeler $j$ ’s accuracy for “Is $x_i$ of class $k'$ ? $<$ YES/NO $>$ ” when gold is $k$
$y_{i,k'}^j \mid z_i=k \sim \text{Bernoulli}(\theta_{k,k'}^j)$ : observed response

The resulting posterior is not available in closed form due to the coupling of $z_i$ and $\theta_{k,k'}^j$ , necessitating approximate inference (MCMC with NUTS, or Black Box Variational Inference). Posterior class probabilities are then aggregated per object, with empirical evidence showing that only 2–4 yes/no questions per object per labeler suffice to match traditional full-class queries (>90% accuracy) (Saldias et al., 2019). This formulation supports uncertainty estimation and explicit modeling of annotator confusion matrices.

4. Bundled Reward Modeling and Zero-Shot Preference Alignment

YNB rewards can be constructed systematically in RLHF and related regimes by bundling multiple yes-no queries. The ZYN framework (Gallego, 2023) evaluates an output $o$ with $K$ yes/no questions $\{q_i\}$ :

$r_\text{bundle}(o) = \sum_{i=1}^K w_i r_i(o, q_i)$

with $r_i$ the reward from critic LM logits, and $\sum w_i=1$ .

Each yes/no prompt takes the form:

1
2
3

Text: {o}
Question: {q}
Response:

and is scored using variants such as the Bradley–Terry probability or logit difference:

$r(o,q) = \frac{\exp\{v_{Yes}(o,q)\}}{\exp\{v_{Yes}(o,q)\} + \exp\{v_{No}(o,q)\}}$

or $r(o,q) = v_{Yes}(o,q)-v_{No}(o,q)$ .

YNB-rewarded RL can optimize for bundles of attributes—e.g., to steer sentiment, reduce toxicity, adjust theme (text-to-image prompt generation), or bias model opinion—all via modular yes/no scoring. This method performs well across use cases—sentiment reward scores shift from 0.34 (original) to 3.20 (ZYN-PPO) in GPT-2; toxicity is reduced from $p_\text{toxic}=0.27$ to $0.05$–$0.03$ on downstream hate-speech classifiers (Gallego, 2023).

5. Robustness and Pitfalls: Error Patterns and Model Laziness

YNB outcomes, especially in user text and MLLMs, display failure modes not evident from classical evaluation. Error analysis of social bundles (Mathur et al., 2023) shows one-third of mistakes are due to lexical distractors, 21% stem from social media slang, and 13% from humor/irony.

In vision-language foundation models (LazyBench (Zhao et al., 15 Oct 2024)), “model laziness” is the dominant failure mode: models that correctly generate a description frequently err on the corresponding yes/no query. The lazy rate $R_\ell$ is formally

$R_\ell = \frac{N_\text{lazy}}{N_\text{YN}^\text{err}}$

with rates of 62–75% on flagship MLLMs (GPT-4o, Gemini-1.5-pro, Claude 3, LLaVA-1.5-13B). The paired-task performance gap $\Delta A$ averages 22.01%. In VQA-v2, 41% of yes/no errors by LLaVA-1.5-13B are due to laziness, not visual ambiguity.

6. Best Practices and Design Implications for YNB Systems

Empirical findings across domains suggest stable design principles for dependable YNB outcomes:

Annotation: Employ multi-way outcome schemas for human/colloquial data, with explicit tags for hedging and unknown/irrelevant replies. Train annotators on sarcasm, indirectness, and keyword distractors (Mathur et al., 2023).
Model Inputs: Always bundle the original question and candidate answer as input for text models. For reward modeling, use bundles of critics/questions and aggregate via convex combinations (Gallego, 2023).
Blending and Transfer: Blend target YNB data with large, out-of-domain corpora; use schedule-based down-weighting of noisy data during training. Cross-domain and cross-lingual transfer benefit from explicit blending (Mathur et al., 2023, Wang et al., 2023).
Explanation/CoT Prompting: For MLLMs, chain-of-thought (CoT) prompting (“describe, then answer yes/no”) remedies up to 24.8 point deficits in yes/no task accuracy and corrects 36–44% of lazy cases (Zhao et al., 15 Oct 2024).
Consensus Checking: Validate YNB with consistent, multi-format bundles—freeform, yes/no, and short-answer responses on the same subject.
Adaptive Querying: Allocate yes/no queries adaptively based on inferred annotator reliabilities, minimizing effort in crowdsourcing or RL feedback (Saldias et al., 2019).
Commonsense and Multimodal Signals: Incorporate knowledge bases and external modalities to cope with implicit, indirect, or visual reasoning (Mathur et al., 2023, Wang et al., 2023).

The YNB paradigm enables data- and label-efficient inference, robust aggregation in the face of annotation variance, interpretable reward construction, and richer evaluation of system reliability.

7. Extensions and Open Issues

YNB methods are amenable to future work in domain adaptation (continual blending), richer multimodal integration, and reinforcement of cross-modal outcome consistency as an explicit training objective (Mathur et al., 2023, Zhao et al., 15 Oct 2024). Challenges persist in edge-case disambiguation (e.g., indirect/sarcastic user answers), robust recovery of multiclass labels from minimal binary data, and mitigation of model laziness. For non-English and low-resource settings, rule-driven distant supervision and truly multilingual architectural blending remain key (Wang et al., 2023).

A plausible implication is that Yes-No Bundles, when paired with principled annotation, model architectures, and consistency checks, furnish a compact yet high-recall approach for semantic labeling, reward modeling, and model evaluation in settings where ambiguity, indirectness, and label sparsity are prevalent.