Learning to Answer from Correct Demonstrations (2510.15464v1)
Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple question: if many different answers to a question can be equally correct, how should we train a model so that it gives any one correct answer, not necessarily the same answer or style we saw in training? The authors show that the usual way people train models (by copying the exact style or distribution of answers they see) can fail in this setting. They propose a new way to learn that focuses on what makes an answer correct, rather than how a particular demonstrator writes it, and they prove it needs surprisingly few examples.
What questions does the paper try to answer?
- When there are lots of acceptable answers, is it better to learn “how the teacher writes answers” or “what makes an answer correct”?
- Does the standard training method (maximum likelihood/log-loss, often used in supervised fine-tuning for LLMs) work well when there are many correct answers?
- Can we build a learner that focuses on correctness and needs only a small number of training examples?
- How does this change if we allow the model to propose several answers (like the popular pass@k metric), or if the reward isn’t just yes/no but a score between 0 and 1?
- What if the demonstrator who produced the training answers is not perfectly optimal?
How did they approach the problem?
Think about Q&A as a game:
- The “context” is the question.
- An “action” is the answer the model gives.
- The “reward” is 1 if the answer is acceptable and 0 if not.
In training, we are shown questions paired with one correct answer (a demonstration). We are not told the reward rules; we only see examples of correct answers.
Two ways to model what’s going on:
- Policy-first (traditional): Assume the demonstrator (the person or model giving training answers) follows a simple, low-complexity rule for how to write answers. This motivates maximum likelihood (MLE), which tries to match the demonstrator’s answer distribution exactly. In practice, this is log-loss minimization used in supervised fine-tuning.
- Reward-first (this paper): Assume the underlying rules that decide which answers are correct come from a small, manageable set of possibilities (a “low-cardinality reward model class”). We don’t assume anything about how the demonstrator chooses among all correct answers.
The authors show that focusing on reward rules is weaker (and more realistic) than focusing on the demonstrator’s style. When creating high-quality datasets, we ensure answers are correct, but we don’t try to cover every possible correct answer or enforce any particular writing style.
They do three main things:
- Prove that standard MLE can fail in this “many correct answers” scenario, even with very simple reward classes.
- Design a new algorithm that uses a “weighted majority over rules” idea:
- Imagine a small set of candidate “correctness rules” (each rule says, for each question, which answers would be acceptable).
- Keep a weight for each rule.
- For a new question, pick the answer that is supported by the highest total weight of rules.
- After seeing the demonstration (a correct answer), increase the weight of rules that were not satisfied by your prediction but do include the demonstrated answer, and set to zero any rules that contradict the demonstration.
- Convert this online “weight-updating” process into a statistical learner you can train on a dataset (an online-to-batch conversion), and prove strong guarantees.
They also extend the approach to:
- Bounded (non-binary) rewards between 0 and 1.
- pass@k, where the model can propose k answers and it’s “correct” if any one is good.
- Cases where the demonstrator is not necessarily optimal (they show how to still compete with the demonstrator’s error up to a small constant factor).
What did they find, and why does it matter?
Here are the key results, explained simply:
- MLE can fail: If you try to copy the demonstrator’s distribution, you can end up making bad choices on new questions, because copying style doesn’t guarantee correctness when the set of correct answers is huge. They provide concrete failure examples where MLE keeps a high error, even with plenty of training data.
- Their learner succeeds with very few examples: The number of examples needed grows only like the logarithm of the number of possible reward-rule models. “Logarithm” means it grows very slowly — for instance, doubling the number of candidate rules only adds a small constant amount of data needed. Importantly, this does not depend on:
- How many possible answers there are.
- How large the set of correct answers is for each question.
- pass@k improves sample efficiency: If the model can suggest k answers per question, the number of examples needed improves to scale like the logarithm base (k+1) of the number of possible reward-rule models. So letting the model give multiple attempts helps a lot, which matches real-world practice where pass@k is a common metric.
- Works for graded rewards too: If rewards are not just yes/no but any number between 0 and 1, the same idea still works. If you learn to avoid wrong answers (low reward) and aim for the set of best answers (high reward), your expected reward is close to the best possible.
- Handles imperfect demonstrators: If the training answers come from a suboptimal demonstrator, the algorithm can still compete with the demonstrator’s loss within a small constant factor. This is helpful when your dataset isn’t perfect.
Why it matters: In many real tasks (math solutions, coding, recommendations, writing), there are countless acceptable answers. Training to mimic one person’s style can be misleading. This paper provides a principled way to train models to be correct rather than “to copy,” with strong sample-efficiency guarantees.
What are the broader implications?
- For training LLMs, the paper suggests looking beyond log-loss and distribution matching when the goal is “give any correct answer.” If you care about utility (getting a correct or high-reward answer), focus on correctness rules, not stylistic imitation.
- Dataset design: When curating data, it’s often realistic to ensure answers are correct without trying to cover all possible correct phrasings. This approach leverages that reality and still learns efficiently.
- Practical metrics: Since pass@k is widely used, the paper’s theory explains why allowing multiple attempts can reduce data needs and improve reliability.
- Robustness: The method is more robust to variation in how demonstrators write answers and to imperfect demonstrators.
- Future directions: Combining this correctness-focused learning with other objectives (like safety, diversity, or fairness), and extending value guarantees when demonstrators are suboptimal, are natural next steps.
In short, the paper argues: when there are many ways to be right, train models to be right, not to mimic. They back this up with clear failure cases for traditional training and a new algorithm with tight, provably efficient learning guarantees.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of unresolved issues and limitations that emerge from the paper, framed to guide concrete follow-up research.
- Computational tractability of the proposed learner: The algorithm requires maintaining and updating weights over every hypothesis in a finite reward-model class S and computing weighted majorities per context. There is no analysis of runtime, memory, or scalable implementations when |S| or |Y| is large, nor for structured or parametric representations of S.
- Instantiation of low-cardinality reward model classes in realistic LLM settings: The paper assumes a small finite class S of “correctness support” functions, but does not specify how to construct or encode such classes for complex tasks (e.g., code synthesis, math proofs) where correctness is defined via dynamic checks, program execution, or formal verification. Concrete constructions of S for practical domains are missing.
- Infinite or large reward model classes: Results hinge on finite cardinality of S (or R). It remains open how to extend guarantees to infinite classes under complexity measures (e.g., VC dimension, covering numbers, Littlestone dimension), or to derive rates in terms of such measures rather than |S|.
- Lower bounds in the main (single-answer) setting: While the paper proves optimality for the pass@k setting and a lower bound for the majority rule, there is no general minimax lower bound matching O((log|S| + log(1/δ))/ε) for the primary setting beyond heuristics. A sharp lower bound for arbitrary learners under the model-class assumption is missing.
- Conditions under which MLE succeeds with low-cardinality reward classes: The paper shows failure modes for MLE, but does not characterize positive regimes (e.g., structural constraints, coverage conditions, priors, regularization) where likelihood-based training could succeed even when only S is small.
- Bridging to parameterized models and gradient-based training: The proposed learner is not expressed as a differentiable objective amenable to standard training of LLMs (e.g., cross-entropy optimization). A concrete surrogate loss or training procedure that operationalizes “learning from correct demonstrations” without distribution matching is absent.
- Extension to infinite action spaces Y (e.g., token sequences): The analysis relies on existence of optimal policies and finite Y for some guarantees; the paper explicitly notes vacuity in some infinite-action cases. Methods for approximate optimality, discretization schemes, or continuous-action adaptations are not provided.
- Robustness to misspecification of the reward model class: Guarantees assume σ* ∈ S (or r* ∈ R). There is no analysis for the agnostic case when the true support/reward function lies outside the hypothesized class, nor a procedure to adaptively expand or refine S.
- Suboptimal demonstrator and value guarantees: In the agnostic demonstrator setting, the paper only provides loss-competitiveness (with constant blowups) and explicitly notes that reward-value suboptimality (especially for non-binary rewards) is not guaranteed. Achieving value suboptimality under low-cardinality reward classes with suboptimal demonstrations remains open.
- Noisy or incorrect demonstrations: Beyond “suboptimal demonstrator,” the paper does not analyze robustness to label noise (e.g., randomly incorrect demonstrations), adversarial corruption, or partial credit signals, nor provide noise-tolerant weight updates or bounds.
- Multiple demonstrations per context: Training uses single correct demonstrations per context. How sample complexity and constants improve with multiple demonstrations (and how to aggregate them effectively in the learner) is not explored.
- Active or interactive data acquisition: The learner is purely passive (offline). There is no investigation of active learning or querying strategies (e.g., selecting contexts, asking for additional demonstrations) that could reduce sample complexity given access to a small S.
- Distribution shift and domain adaptation: All guarantees assume test contexts drawn from the same distribution as training. The paper does not address robustness or adaptation under covariate shift, which is common in deployed QA systems.
- Generalization of pass@k beyond “max” aggregation: The pass@k extension uses maximal reward among k suggestions. Other aggregation functions (e.g., average, weighted verifiers, coverage/diversity constraints, or verifier-dependent trust) are not analyzed.
- Choosing k and verifier integration: The paper gives theoretical gains with larger k but does not paper how to select k under verification costs, how to integrate external verifiers (e.g., unit tests, proof checkers, LLM judges), or how verification accuracy/noise impacts guarantees.
- Structural priors on σ(x): The learner treats each context independently via S. Exploiting structure in σ(x) (e.g., compositionality, grammar constraints, program semantics) to shrink effective hypothesis complexity and accelerate learning is not addressed.
- Practical evaluation and empirical validation: The work is purely theoretical; there are no experiments demonstrating the failure of MLE in realistic settings, the effectiveness of the proposed learner, or performance on standard LLM benchmarks (including pass@k).
- Adaptive tuning of hyperparameters (α, β): Choices like α=2 (realizable) and α=4/3, β=2/3 (agnostic) are given without data-driven or theoretically optimal tuning procedures. Whether adaptive or instance-dependent tuning can improve constants or high-probability bounds is unresolved.
- Sequential/token-level decision processes: Although the paper motivates treating full responses as single actions, many practical tasks are inherently sequential. Extending the learner to MDPs with partial rewards, credit assignment across tokens, and non-deterministic transitions remains open.
Practical Applications
Immediate Applications
The following applications can be deployed now, assuming access to correct demonstrations and basic verifiers that define what “correct” means for each task.
- Supervised fine-tuning of LLMs using reward-focused training instead of log-loss
- Sector: software, education, general AI products
- What to do: Replace likelihood/log-loss minimization with the paper’s weighted-majority learner (via the provided online-to-batch conversion) that optimizes correctness (utility) rather than cloning the demonstrator’s distribution.
- Example workflows:
- Code generation: train on programs that pass unit tests; the reward class is “passes the test suite.”
- Math problem solving: train on solutions verified by a symbolic/automated checker (e.g., Lean, SymPy).
- Structured output tasks: train on answers that validate against a schema/JSON/XML validator.
- Tools/products: “Verifier-first SFT toolkit” that:
- Defines a small set of reward model templates (e.g., test harnesses, validators).
- Implements the online-to-batch weighted-majority learner to produce policies that output any correct answer.
- Assumptions/dependencies: availability of correct demonstrations; small, well-specified reward model classes or verifiers; contexts sampled i.i.d.; ability to check membership in the support set σ(x) for each reward model.
- Pass@k training and inference for tasks with easy verification
- Sector: software, education, writing assistance
- What to do: Use the paper’s pass@k extension to select k diverse candidate answers that maximize coverage of correctness across reward models; verify any one candidate to accept.
- Example workflows:
- Code assistants proposing k implementations; accept if any passes unit tests.
- Proof assistants proposing multiple proof sketches; accept if any verifies in a proof checker.
- Writing assistants offering k drafts that meet compliance/citation checks.
- Tools/products: “Pass@k training harness” that greedily selects k responses by covering the remaining weighted hypotheses; integrates with verifiers/test suites.
- Assumptions/dependencies: ability to surface multiple outputs; verifiers available; compute budget to generate k candidates; small reward model class.
- Dataset curation focused on correctness instead of stylistic diversity
- Sector: academia, industry LLM teams
- What to do: Curate datasets by ensuring correctness via verifiers/rubrics; stop trying to sample representative “styles” of correct answers.
- Example workflows:
- Label examples with pass/fail against domain-specific validators (tests, schemas, checkers) that define correctness; track which reward template applied.
- Tools/products: “Reward-template annotation schema” for datasets that records which σ(x)/reward model was used to validate each demonstration.
- Assumptions/dependencies: reliable validators; agreement on what constitutes correctness per task; known low-cardinality set of reward templates.
- Evaluation pipelines that prioritize utility (correctness) rather than distribution matching
- Sector: industry benchmarking, academia
- What to do: Evaluate models by loss/value relative to support sets (σ(x)) or bounded rewards r, not by log-likelihood.
- Example workflows:
- Report L(π) = fraction of outputs not in σ(x) over test contexts; or value V(π, r) for bounded rewards.
- Use pass@k metrics with verifiers as primary KPIs on tasks where verification is feasible.
- Tools/products: “Utility-first evaluation harness” compatible with unit tests, formal checkers, schema validators.
- Assumptions/dependencies: verifiers; curated test sets; acceptance that stylistic matching is not required when correctness is the goal.
- Recommendation and retrieval tasks with many acceptable outcomes
- Sector: recommender systems, search
- What to do: Treat tasks as contextual bandits where many items are correct/satisfactory; apply the weighted-majority learner to produce any acceptable item.
- Example workflows:
- Personalized item suggestions where any item in a validated subset (e.g., availability, user constraints) is “correct.”
- Tools/products: policy that selects items from validated support sets rather than matching historical click distributions.
- Assumptions/dependencies: a small set of reward models that define acceptability; ability to verify acceptance (constraints, policies).
- Few-shot/domain adaptation where data is scarce but correctness is verifiable
- Sector: enterprise AI, on-device assistants
- What to do: Exploit O(log|S|) sample complexity by training with small datasets if you can define a small catalog of reward models; optimize for correctness only.
- Example workflows:
- Private on-device fine-tuning for structured tasks (form filling, code fixes) using validators.
- Tools/products: lightweight fine-tuning modules compatible with validators.
- Assumptions/dependencies: small reward model class; reliable verifiers; correct demonstrations.
Long-Term Applications
These applications require further research, scaling, and development of tooling, verifiers, or theory.
- Broad catalogs of domain-specific reward models and verifiers
- Sector: healthcare, finance, legal, robotics, education
- What to do: Build small cardinality libraries of correctness templates and robust verifiers that define acceptable outputs for common tasks (e.g., clinical coding checks, regulatory compliance tests, contract clause validation, simulator-based task completion).
- Potential tools/products:
- Domain “reward packs” (rule engines, formal methods, test harnesses).
- Cross-domain verifier registries and APIs.
- Assumptions/dependencies: agreement on correctness criteria; reliable automated verification; standardization across organizations.
- Integrating verifier-first training into RLHF/RLAIF pipelines
- Sector: AI training infrastructure
- What to do: Combine automated verifiers with human preferences to move beyond distribution matching; prioritize correctness utility, use pass@k where verification is efficient.
- Potential tools/products:
- Hybrid pipelines that use reward-template learners for correctness, and preferences for quality trade-offs (style, helpfulness).
- Assumptions/dependencies: scalable verifiers; careful aggregation of human preferences without reverting to pure log-loss cloning.
- Value-optimal learning from suboptimal demonstrators (general bounded rewards)
- Sector: academia (learning theory), core ML research
- What to do: Address the open problem highlighted by the paper: achieve value suboptimality guarantees (not just loss vs σ) when demonstrations aren’t optimal.
- Potential workflows:
- New algorithms that bridge reward-based learning and off-policy evaluation in contextual bandits with multiple optimal actions.
- Assumptions/dependencies: theoretical advances; possibly richer feedback signals or confidence-calibrated verifiers.
- Extending the approach to token-level/sequential decisions in generation
- Sector: LLM research, robotics planning
- What to do: Generalize from “response as single action” to multi-step settings; ensure correctness reward at sequence end without exploding |Y| dependence.
- Potential tools/products:
- Sequence-level verifiers (e.g., compilers, end-to-end simulators) with segment-level hints for training.
- Assumptions/dependencies: scalable sparse-reward learners; reliable end-of-sequence verification; stability of online-to-batch methods in sequential settings.
- Regulatory and policy frameworks prioritizing utility-first evaluation
- Sector: policy, compliance, public-sector AI procurement
- What to do: Standardize on correctness-first evaluation (utility) with documented reward models; de-emphasize training that clones demonstrator distribution.
- Potential tools/products:
- Audit standards that certify verifiers and reward model catalogs; procurement checklists that require utility-based KPIs (e.g., pass@k under certified verifiers).
- Assumptions/dependencies: consensus on auditing methods; transparency in reward model definitions; verifiers that meet regulatory robustness standards.
- Fairness and bias mitigation via “correctness without style imitation”
- Sector: policy, education, content moderation
- What to do: Reduce bias induced by mimicking particular annotator styles by focusing on correctness sets; develop auditing tools that measure the shift from distribution matching to reward-based utility.
- Potential tools/products:
- Bias audits that compare performance under utility-based training vs log-loss cloning on diverse populations or writing styles.
- Assumptions/dependencies: fair and inclusive correctness definitions; robust verifiers that do not encode stylistic bias.
- Multi-modal and embodied AI: correctness-first training with simulators
- Sector: robotics, autonomous systems, graphics
- What to do: Define reward templates where “correct” means successful task completion in simulation; adopt pass@k trajectories/plans with verification.
- Potential tools/products:
- Simulator-integrated training harnesses; trajectory verifiers; diversity-promoting k-selection strategies.
- Assumptions/dependencies: high-fidelity simulators; scalable verification; small catalog of task-specific reward templates.
- Enterprise on-device/private training using verifiers for sample efficiency
- Sector: enterprise software, privacy-focused AI
- What to do: Leverage O(log|S|) sample complexity to fine-tune models with limited private data where verifiers exist (e.g., form validation, policy compliance).
- Potential tools/products:
- On-device verification libraries; local pass@k suggestion panels with immediate validation.
- Assumptions/dependencies: robust local verifiers; hardware resources for k candidates; small reward model classes defined per enterprise workflow.
- Theory and algorithms for infinite or large reward model classes
- Sector: academia (learning theory)
- What to do: Develop approximations and structural assumptions that extend the paper’s guarantees beyond finite, low-cardinality classes (e.g., covering numbers, margin conditions).
- Potential tools/products:
- Practical learners that operate with parametric reward families; adaptive hypothesis pruning informed by verifier signals.
- Assumptions/dependencies: new generalization bounds; scalable hypothesis management; partial/inexact verification signals.
Glossary
- Consistency: Agreement of a hypothesis with all observed training examples. "consistent with the data"
- Context distribution: The probability distribution over contexts (inputs) from which instances are drawn. "context distribution "
- Contextual bandits: A bandit learning framework where the learner observes a context and chooses an action to maximize reward. "formalized as contextual bandits."
- Demonstrator: The policy that provides example actions (answers) in the training data. "assume that the demonstrator belongs to a low-capacity policy class ."
- Freedman’s inequality: A concentration inequality for martingales used to bound deviations of sums of dependent random variables. "with an additional application of Freedman's inequality"
- Hellinger distance (squared Hellinger distance): A divergence measure between probability distributions; its square is often used in analysis. "is the squared Hellinger distance"
- Hypothesis class: A set of candidate functions/models among which learning is performed. "a known hypothesis class of support functions"
- Log-loss minimization: Training objective equivalent to maximizing likelihood, penalizing the negative log-probability of observed data. "or equivalently log-loss minimization"
- Markov Decision Process (MDP): A sequential decision-making model with states, actions, transitions, and rewards. "Response generation by LLMs can be seen as a Markov Decision Process (MDP),"
- Martingale difference sequences: Sequences with zero conditional expectation used in concentration analyses for dependent data. "concentration for martingale difference sequences"
- Maximum Likelihood Estimation (MLE): A method for fitting models by maximizing the likelihood of observed data. "maximum likelihood estimation (MLE)"
- Minimax optimal: Achieving the best worst-case performance among all methods under given assumptions. "MLE is minimax optimal with respect to $|#1|{\Pi}$"
- Missing mass: The probability of unobserved contexts (or events) given limited samples. "the missing mass (i.e., unobserved contexts) is arbitrarily close to 1"
- Mistake bound: An upper bound on the number of prediction errors an online algorithm makes. "makes at most mistakes."
- Online Mistake-Unaware Contextual Bandits: An online learning setting where the learner does not observe whether its action was correct but receives a separate correct demonstration. "Online Mistake-Unaware Contextual Bandits with Correct Demonstrations:"
- Online-to-batch conversion: A technique to convert online learning guarantees into statistical (batch) learning guarantees. "We now use the online-to-batch conversion in \Cref{alg:batch-from-online} to obtain a statistical estimator."
- Pass@: An evaluation objective where success is credited if any of k proposed answers is correct. "we discuss the objective"
- Policy class: The set of possible policies (mappings from contexts to action distributions) considered by the learner. "policy class "
- Product distribution: A joint distribution over multiple variables that factors into independent marginals. "This need not be a product distribution"
- Realizable case: Setting where the ground-truth model lies within the considered hypothesis class. "In the realizable case (when )"
- Reward model class: The set of possible reward functions or supports defining which answers are correct. "reward model class having low cardinality"
- Sample complexity: The number of training examples needed to achieve a desired performance level. "learns with sample complexity logarithmic in the cardinality of the reward class"
- Support function: A mapping from each context to the set of optimal (correct) actions for that context. "unknown support function "
- Value suboptimality: The gap between the achieved expected reward of a learned policy and the optimal expected reward. "we would like to find an -value-suboptimal policy "
- Weighted majority rule: A prediction strategy that selects actions supported by the largest total weight of consistent hypotheses. "Predictions are based on a weighted majority rule"
Collections
Sign up for free to add this paper to one or more collections.