Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 152 tok/s Pro
GPT OSS 120B 325 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Language Models Can Learn from Verbal Feedback Without Scalar Rewards (2509.22638v1)

Published 26 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

Summary

  • The paper introduces a feedback-conditional policy (FCP) that leverages rich verbal feedback without reducing it to scalar rewards.
  • It details an offline maximum likelihood training and an online bootstrapping method to align model outputs with desired feedback conditions.
  • Empirical results demonstrate that FCP rivals or outperforms scalar RL baselines in controllability, data efficiency, and robustness.

Feedback-Conditional Policy: Learning from Verbal Feedback Without Scalar Rewards

Motivation and Theoretical Framework

The paper challenges the reward hypothesis, which posits that all goals and purposes in RL can be reduced to maximization of expected scalar rewards. In the context of LLMs, feedback is often verbal, nuanced, and mixed, rather than strictly scalar. Scalarization of such feedback leads to information loss, ambiguity, and reward scale imbalance across tasks. The authors propose a paradigm shift: treat verbal feedback as a conditioning signal, leveraging the strong language priors of LLMs to learn directly from response-feedback pairs.

The central construct is the feedback-conditional policy (FCP), πθ(yx,c)\pi_\theta(y|x, c), which models the conditional distribution of responses yy given instruction xx and feedback cc. Offline, FCP is trained via maximum likelihood on data collected from a reference policy and a feedback environment, approximating the feedback-conditional posterior. Online, FCP is bootstrapped by conditioning on positive feedback and iteratively refining the policy with fresh critiques. Figure 1

Figure 1: Learning from mixed verbal feedback, illustrating how FCP can leverage diverse feedback signals to generate improved responses.

The analogy to text-to-image generation is instructive: just as models can recombine captions to generate novel images, FCP can recombine verbal feedback to generate responses aligned with user-specified conditions.

Implementation Details

Offline Training

Offline FCP training proceeds as follows:

  1. Reference Policy Generation: For each instruction xx, sample a response yπref(yx)y \sim \pi_{\textrm{ref}}(y|x).
  2. Feedback Collection: Obtain feedback cpenv(cx,y)c \sim p_{\textrm{env}}(c|x, y) from the environment (human or generative model).
  3. Maximum Likelihood Objective: Train πθ(yx,c)\pi_\theta(y|x, c) to maximize E(x,y,c)Doff[logπθ(yx,c)]\mathbb{E}_{(x, y, c) \sim \mathcal{D}_{\textrm{off}}}[\log \pi_\theta(y|x, c)].

The feedback cc is prepended to the instruction using special tokens (<EF>, </EF>), enabling the model to condition on arbitrary feedback at inference.

Online Bootstrapping

Online bootstrapping refines FCP by conditioning on positive feedback c+c^+:

  1. Condition Sampling: For each instruction, sample a desired feedback condition c+c^+ from a pool of positive feedbacks.
  2. Rollout: Generate candidate responses yπθ(yx,c+)y \sim \pi_\theta(y|x, c^+).
  3. Re-annotation: Obtain fresh feedback cc for each response.
  4. Policy Update: Train on (x,y,c)(x, y, c) tuples using cross-entropy loss.

This iterative process strengthens alignment with user-specified feedback, allowing the model to internalize the conditioning signal.

Practical Considerations

  • Feedback Environment: The authors use GPT-5-nano to simulate both real-world user-style and professional reviewer-style feedback, ensuring consistency across FCP and scalar RL baselines.
  • Batching and Loss Aggregation: Larger prompt batch sizes and debiased loss aggregation (sequence-level mean, token-level sum) improve optimization efficiency and accuracy.
  • Length-Related Conditions: Conditioning on conciseness or verbosity can destabilize training, leading to collapse or instability in response length. Filtering such conditions yields more robust learning dynamics. Figure 2

Figure 2

Figure 2: Average accuracy over five math benchmarks measured at intermediate checkpoints, showing FCP+Bootstrap matches scalar RL baselines.

Figure 3

Figure 3

Figure 3: Average response length over training steps, illustrating the effect of length-related feedback conditions on output stability.

Figure 4

Figure 4: Prompt template used to elicit feedback from GPT-5-nano, supporting both verbal and scalar reward pipelines.

Empirical Results

FCP is evaluated on mathematical reasoning (Big-Math, AIME, MATH500, Minerva, OlympiadBench) and general reasoning (WebInstruct, GPQA-Diamond, MMLU-Pro, TheoremQA). Key findings:

  • Offline FCP achieves accuracy comparable to rejection sampling finetuning (RFT), despite absorbing noisier supervision.
  • FCP+Bootstrap matches or slightly surpasses strong scalar RL baselines (GRPO, RFT+GRPO) in both in-domain and out-of-distribution settings.
  • Controllability: FCP internalizes the feedback condition, enabling controllable behavior (e.g., generating incoherent responses under negative feedback, increasing code ratio under stylistic conditions).
  • Robustness: FCP avoids reward hacking and over-optimization of scalar scores, sustaining high accuracy even when scalar rewards lag behind. Figure 5

    Figure 5: Learning from mixed captions in text-to-image generation, motivating the feedback-conditional paradigm for LLMs.

Theoretical and Practical Implications

The FCP framework reframes feedback-driven learning as conditional generation rather than reward optimization. This has several implications:

  • Bypassing Scalarization: FCP preserves the richness of verbal feedback, avoiding the information loss and bias introduced by scalar conversion.
  • Data Efficiency: All response-feedback pairs are utilized, including mixed and uncertain feedback, improving sample efficiency over filtering-based methods.
  • Controllability: At inference, users can specify arbitrary feedback conditions, enabling fine-grained control over model behavior.
  • Scalability: FCP is compatible with lightweight feedback sources, such as real-world user critiques, facilitating large-scale deployment.

The approach also aligns with inverse dynamics modeling, complementing behavior cloning (SFT) and forward dynamics (CFT) in the LLM finetuning landscape. Figure 6

Figure 6

Figure 6

Figure 6: SFT (behavior cloning) graphical model, contrasting with FCP's inverse dynamics structure.

Future Directions

Potential extensions include:

  • Hybrid Supervision: Integrating verifiable scalar rewards as neutral conditions to complement verbal feedback.
  • Multi-Turn Interaction: Extending FCP to iterative, teacher-forcing style interactions for closer alignment with human guidance.
  • Test-Time Adaptation: Conditioning on user-provided examples for rapid personalization.
  • Multimodal Feedback: Incorporating non-textual feedback signals for broader applicability.

Conclusion

The feedback-conditional policy framework demonstrates that LLMs can learn directly from verbal feedback without scalar rewards, matching the performance of scalar RL pipelines while retaining the expressiveness and controllability of natural language supervision. This paradigm offers a principled alternative to the reward hypothesis for language-centric systems, with significant implications for alignment, data efficiency, and real-world deployment of LLMs. Future work should explore hybrid and multimodal extensions, multi-turn adaptation, and broader integration of natural feedback into LLM training.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question: Can we teach AI LLMs using plain words (like “Nice job, but be more concise”) instead of turning those words into a single number score (like 0.8 out of 1)? The authors say yes. They introduce a new way to train models called a “feedback‑conditional policy” (FCP), which treats verbal feedback as a direct instruction for how the model should respond, rather than squeezing it into a reward number.

What are the main goals or questions?

The paper focuses on three easy‑to‑grasp goals:

  • Use real, detailed feedback (the kind a person might write) to teach a model, without converting it into one scalar reward number.
  • Let the model learn how to respond differently depending on the kind of feedback it’s given (for example, “be correct and clear” versus “has code and is concise”).
  • Show that this approach can work as well as, or better than, popular methods that rely on numeric rewards.

How does their method work?

Think of it like a student learning to write essays:

  • Traditional grading gives a single score (e.g., 85/100). That’s simple, but it throws away useful details.
  • Real comments, like “Your idea is great, but your paragraphs are too long,” contain richer guidance.

The authors build a system that learns from those richer comments in two stages:

Offline stage: Learn from past pairs of responses and feedback

  • The model (the “reference policy”) writes responses to lots of questions or tasks.
  • An “environment” (could be a person or another model) gives verbal feedback on those responses.
  • The new model, FCP, is trained on the pairs: instruction + response + feedback.
  • The key idea: treat the feedback itself as a condition. In everyday terms, the model learns “If the feedback says ‘correct and clear,’ what kind of response would produce that feedback?” It’s like learning the inverse relationship between responses and the comments they get.

Analogy: In text‑to‑image tools, you can combine ideas from different captions (“a banana” + “surfing on the ocean”) to make a new image (“a banana surfing on the ocean”). Similarly, FCP uses mixed, varied feedback to assemble better responses that fit the desired feedback style.

Online stage: Bootstrap with “positive feedback” goals

  • After the offline training, the model tries generating new responses while aiming for a chosen “positive” feedback condition (like “correct, efficient, and concise”).
  • It then receives fresh feedback on these new attempts.
  • The model updates itself using this new feedback, getting better over time at producing responses that match the chosen positive condition.

In short: instead of chasing higher reward numbers, the model learns how to produce answers that would draw the kind of feedback you want.

What did they find, and why does it matter?

Here are the main results and why they’re important:

  • FCP is competitive with strong baselines that use numeric rewards. On math and general reasoning tests, the FCP method (with the online bootstrapping step) matched or slightly beat top reward‑based methods like GRPO and RFT+GRPO.
  • It works without special “verifiers.” Many reward‑based methods depend on reliable automatic graders or strict rules. FCP learns directly from feedback text, so it can be used in messy, real‑world tasks where grading isn’t clear.
  • It preserves rich feedback. Instead of compressing mixed comments (“correct but too wordy”) into a single score, FCP uses the full detail to shape the model’s behavior.
  • It supports controllable behavior. If you tell the model to aim for feedback like “has code and is concise,” it shifts its style accordingly. This shows it truly understands the feedback condition.
  • It seems less prone to “reward hacking.” Reward‑based methods can over‑optimize for the number a scorer likes, sometimes at the cost of actual usefulness. FCP reaches high accuracy without chasing the reward model’s favorite patterns.

A caution they discovered: feedback conditions about length (like “be concise”) can make the model’s answers get shorter and shorter over time, which hurt math reasoning. Filtering out length‑related goals made training more stable.

What’s the bigger impact?

This work suggests a shift in how we train LLMs:

  • We don’t always need to force feedback into numbers. Treating verbal feedback as a first‑class training signal can make models learn more naturally.
  • It could reduce the need for expensive, carefully designed scoring systems and allow training from everyday user comments.
  • It opens the door to personalization: at test time, you could set the feedback condition to match your preferences (“explain step‑by‑step,” “use code,” “keep it simple”), and the model adapts.
  • Future ideas include combining verbal feedback with verifiable rewards when available, learning from multi‑turn conversations, adapting to a user’s style from a few examples, and even using multimodal feedback (like images or audio).

Bottom line: Instead of training models to chase a number, we can train them to listen to our words—and that might make them both smarter and more aligned with what we actually want.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored, framed to guide follow-up research:

  • Human feedback realism and variance: All feedback is simulated by an LLM (“GPT-5-nano”); no experiments with real human feedback, inter-annotator disagreement, or heterogeneous user styles, leaving real-world robustness untested.
  • Generalization to unseen or compositional feedback conditions: The model is conditioned on a pool of “positive” feedback seen during training. It is unclear how well FCP adheres to novel, compositional, or out-of-distribution conditions at test time.
  • How to choose the conditioning “goal” at inference: There is no principled method to select or compose the expected positive feedback c+ per task/domain; sensitivity to the phrasing, specificity, and granularity of c+ is unquantified.
  • Residual reliance on scalar signals: The construction of the “positive feedback” pool uses scalar scores, partially undermining the “no scalar rewards” claim; scalar-free ways to identify or generate positive conditions are not provided.
  • Robustness to noisy, contradictory, deceptive, or adversarial feedback: The method assumes “non-deceptive” feedback. Failure modes and defenses under malicious or systematically biased feedback are not studied.
  • Support coverage and off-policy limits: Offline MLE only learns on the support of the reference policy. How to mitigate missing modes (e.g., rare high-quality responses) and support mismatch (importance weighting, exploration, conservative objectives) is left open.
  • Theoretical guarantees and identifiability: There are no convergence, consistency, or error-bound guarantees when optimizing forward KL to an intractable posterior with unknown and misspecified p_env; the causal implications of conditioning on a post-hoc signal (feedback) are not analyzed.
  • Token-level guidance vs sequence-level conditioning: Since p_env is defined after full responses, decoding is only guided via prepended feedback text. Whether more granular token-level credit assignment (e.g., latent alignment, token-level critics) improves learning remains open.
  • Multi-turn feedback and credit assignment: The framework is single-turn; extending to multi-turn, iterative critiques with memory and inter-turn credit assignment is only suggested as future work, with no algorithmic or empirical validation.
  • Hybridization with verifiable rewards: Although proposed in discussion, there are no experiments assessing how to combine verifiable signals with verbal conditioning (e.g., weighting schemes, curriculum, or switching policies).
  • Style-content disentanglement: Results suggest conditioning can sway style (e.g., “has_code”) at the expense of accuracy. Methods to disentangle stylistic preferences from correctness to avoid “style over substance” are missing.
  • Safety and misuse: Conditioning on harmful or negative feedback can drive unsafe or low-quality behavior (demonstrated by “fully_negative” conditions). Guardrails, constraints, and safe conditioning policies are not developed.
  • Stability under condition selection: Length-related conditions destabilize training (collapse to shorter outputs). There is no general method to detect and mitigate destabilizing conditions (e.g., condition classifiers, regularization, constraint-based objectives).
  • Scaling behavior: Only Qwen2.5-7B is evaluated. Scaling laws w.r.t. model size, data volume, feedback quality/quantity, and compute are not characterized.
  • Breadth of baselines: Comparisons omit strong scalar-free or preference-learning baselines (e.g., DPO/IPO, KTO, RLAIF with generative preference models, pairwise/ranking methods), limiting the scope of empirical claims.
  • Domain coverage: Evaluations are on math and general reasoning; no tests on coding, tool use, agentic tasks, or multi-modal settings where verbal tool feedback is most relevant.
  • Condition adherence metrics: Aside from a small probe, there is no systematic measurement of how faithfully outputs satisfy specified conditions (e.g., adherence precision/recall, calibration of control strength).
  • Reward hacking claims: The paper infers less reward hacking because scalar reward scores lag while accuracy is high, but does not include adversarial or off-distribution reward-model tests to substantiate robustness to reward gaming.
  • Distribution shift in online bootstrapping: The environment (feedback generator) is fixed and synthetic. How FCP behaves when feedback style drifts (new users, tools, domains) is untested.
  • Self-reinforcement and forgetting: Online bootstrapping samples from the current policy without off-policy correction or replay; risks of compounding bias, mode collapse, or forgetting are not quantified; no ablation on replay/regularization.
  • Data selection bias: Offline data discards prompts where all responses are correct/incorrect, potentially biasing training; the impact of this filtering on generalization has not been analyzed.
  • Sensitivity to prompting and architecture choices: Only simple concatenation with <EF> tokens is tested; effects of feedback placement, adapters, conditioning layers, or other architectures on control and stability are unexplored.
  • Inference UX and latency: Conditioning requires users (or systems) to craft c+; the usability burden, default templates, and latency overhead are not assessed.
  • Decoding strategy effects: All inference uses greedy decoding; the impact of sampling/temperature, beam search, or guidance methods on condition adherence and accuracy is unknown.
  • Data efficiency and cost: Claims of simplicity and data efficiency are not supported by ablations across feedback/data budgets or cost analyses (LLM feedback generation vs scalar pipelines).
  • Combining positive and negative feedback: While negative conditioning is shown to degrade performance when requested, principled ways to leverage negative feedback for improved learning (e.g., contrastive or adversarial training) are not explored.
  • Alternatives to forward KL: The choice of forward KL (mode-covering) is not compared to reverse KL or other f-divergences/contrastive objectives; importance weighting or variational estimators for p_env are not examined.
  • External, independent evaluation: Apart from accuracy benchmarks, there is no human evaluation or independent preference model to validate perceived quality gains and condition satisfaction beyond the training-time feedback style.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Overview

This paper introduces Feedback-Conditional Policy (FCP), a training paradigm that treats verbal feedback as a conditioning signal rather than compressing it into scalar rewards. FCP is trained offline on response–feedback pairs via maximum likelihood and then bootstrapped online by generating under positive conditions and collecting fresh feedback. Empirically, FCP matches or surpasses scalar-reward baselines (RFT, GRPO) on math and general reasoning tasks, while preserving the richness of feedback and reducing reward-hacking risk.

Below are practical applications derived from the paper’s findings, organized as Immediate and Long-Term. Each item highlights concrete use cases, relevant sectors, enabling tools/workflows, and assumptions/dependencies affecting feasibility.

Immediate Applications

These applications can be deployed now with current LLMs and standard training infrastructure, leveraging response–feedback datasets (human or model-generated) and the FCP training loop.

  • Feedback-driven customer support assistants that learn from user critiques
    • Sector: software, retail/e-commerce, telecom
    • Use cases: Chatbots that adapt to mixed user feedback (“too long,” “not sure,” “try a shorter answer”) without reward engineering; support agents tuned to sentiment and style.
    • Tools/workflows: FCP fine-tuning pipeline; feedback collection UI; feedback wrapper tokens (e.g., <EF>…</EF>); bootstrapping loop that samples “positive” feedback conditions; A/B testing on contact-center metrics.
    • Assumptions/dependencies: Availability of non-deceptive textual feedback; privacy safeguards; stable reference model with strong language priors.
  • Developer copilot and code-review assistants conditioned on textual reviews
    • Sector: software/DevTools
    • Use cases: Code assistants that adjust to reviewers’ written comments (e.g., “prefer functional style,” “optimize complexity,” “avoid global state”) and produce responses likely to elicit positive reviews.
    • Tools/workflows: IDE extension capturing PR comments; feedback-pool builder; offline dataset generation from existing repos/issues; online bootstrapping on internal code tasks.
    • Assumptions/dependencies: Access to structured review text; internal policy around code privacy; careful handling of length-related conditions to avoid unstable brevity.
  • Reasoning tutors and graders that learn from teacher comments without detailed rubrics
    • Sector: education
    • Use cases: Math and general reasoning tutors trained on teacher feedback (“rigorous logic,” “too verbose,” “steps unclear”); auto-grading systems that incorporate teacher comments rather than binary pass/fail.
    • Tools/workflows: LMS integration to collect student-response + teacher-feedback pairs; FCP training with reviewer-style feedback; inference with preferred positive conditions (e.g., “clear reasoning, correct answer”).
    • Assumptions/dependencies: Sufficient quantity/quality of teacher feedback; consent and data governance; avoidance of scalar-reward dependence when rubrics are unavailable.
  • Editorial and marketing content generation tuned via free-form stakeholder feedback
    • Sector: media/marketing
    • Use cases: Copy, blog posts, and summaries that adapt to editorial notes (“more concise,” “more data-backed,” “tone: confident”); campaign assets conditioned on desired feedback (clarity, coherence, persuasion).
    • Tools/workflows: Editorial feedback capture tool; condition library (fully_positive, neutral, style tags); FCP fine-tuning and online bootstrapping cycle.
    • Assumptions/dependencies: Availability of consistent feedback; style taxonomies; safeguards against collapse from aggressive conciseness conditions.
  • Enterprise knowledge-base Q&A systems incorporating employee critiques
    • Sector: enterprise software, HR/IT
    • Use cases: Internal assistants that improve based on mixed feedback (“partly correct but outdated,” “good start, but too technical”), reducing reliance on hard verifiers.
    • Tools/workflows: Feedback ingestion from helpdesk and wiki comments; maximum likelihood offline training on response–feedback pairs; online bootstrapping targeting “clear, accurate, current” conditions.
    • Assumptions/dependencies: Reliable feedback sources across departments; data hygiene; role-based access control.
  • Agentic LLMs that leverage tool outputs as verbal feedback
    • Sector: software/AI agents
    • Use cases: Agents that integrate critiques from tools (linters, test runners, search results) expressed verbally (e.g., “unit tests failing in function X,” “retrieved doc contradicts answer”), improving conditional generation without reward hacking.
    • Tools/workflows: Tool-critique adapters that convert logs/results into natural-language feedback; FCP training and bootstrapping conditioned on “passes tests, consistent with sources.”
    • Assumptions/dependencies: High-quality tool-to-feedback translation; stable reference policies; careful handling of ambiguous or mixed tool feedback.
  • Compliance and documentation assistants that respond to reviewer-style feedback
    • Sector: finance, healthcare, legal
    • Use cases: Drafts adjusted to compliance comments (e.g., “missing disclosure,” “unclear rationale”), improving alignment with policy constraints without handcrafted reward functions.
    • Tools/workflows: Reviewer feedback templates; feedback pools for “complete, compliant, transparent”; FCP inference with target positive feedback conditions.
    • Assumptions/dependencies: Domain-expert feedback availability; regulatory constraints on data use; audit trails for updates.
  • Evaluation teams adopting richer feedback signals to reduce reward hacking
    • Sector: AI governance, ML operations
    • Use cases: Model training/evaluation workflows replacing scalar-only scoring with verbal critiques, mitigating reward gaming and preserving nuanced preferences.
    • Tools/workflows: Unified feedback prompt template; side-by-side comparison dashboards (accuracy vs. reward scores); bootstrapping schedule tuned to avoid over-optimization of scalar signals.
    • Assumptions/dependencies: Organizational buy-in to move beyond scalar metrics; calibration practices for feedback quality; monitoring of length/verbosity dynamics.
  • Personal digital assistants learning from everyday user comments
    • Sector: consumer software
    • Use cases: Assistants that quickly adapt to preferences (“short summaries,” “more examples,” “avoid jargon”) via verbal conditioning rather than parameterized settings only.
    • Tools/workflows: In-app feedback capture; light-weight offline FCP fine-tuning; on-device or privacy-preserving bootstrapping.
    • Assumptions/dependencies: Consent and privacy; sufficient interactions per user; careful guardrails to avoid amplifying negative conditions.

Long-Term Applications

These rely on further research, scaling, or development—e.g., multi-turn integration, personalization, multimodal feedback, standardized practices, and sector-specific validation.

  • Multi-turn feedback integration for iterative guidance
    • Sector: software, education, healthcare
    • Description: Extend FCP to incorporate feedback at each turn (teacher-forcing style), enabling richer iterative instruction and correction.
    • Potential products: “Multi-Turn FCP Trainer” with conversation-level datasets; adaptive tutoring agents; clinical documentation assistants that refine notes through clinician rounds.
    • Dependencies: Datasets with turn-level critiques; sequence-level stability; safety review for sensitive domains.
  • Hybrid verifiable + verbal feedback training
    • Sector: software engineering, math/science reasoning
    • Description: Combine verifiable signals (tests, proofs) with the “null feedback” condition to complement verbal critiques where available, improving robustness.
    • Potential products: “Verifier-Aware FCP SDK”; pipelines that treat missing feedback as neutral and plug in verifiable rewards where applicable.
    • Dependencies: Reliable verifiers; routing between verifiable and verbal pathways; calibration to avoid domain imbalance.
  • Personalization via few-shot feedback conditioning
    • Sector: consumer apps, enterprise tools
    • Description: Test-time adaptation using a small set of user feedback exemplars, yielding fast alignment to style and preferences.
    • Potential products: “Feedback Persona Cards”; per-user condition libraries; cross-session memory integration.
    • Dependencies: On-device storage or privacy-preserving personalization; handling preference drift; opt-in frameworks.
  • Multimodal feedback conditioning (text, audio, images, UI signals)
    • Sector: robotics, accessibility, design tools
    • Description: Fuse non-text feedback (e.g., gestures, audio comments, marked-up documents) into the condition to guide generation across modalities.
    • Potential products: “Multimodal Feedback Encoder”; design co-pilots that learn from annotated mockups; robots that adapt to operator utterances and demonstrations.
    • Dependencies: Multimodal models; robust alignment across modalities; domain-specific safety validations.
  • Standardized feedback taxonomies and collection protocols
    • Sector: AI governance, enterprise ML
    • Description: Establish shared schemas (e.g., clarity, correctness, coherence, conciseness, compliance) for feedback capture, improving model conditioning and comparability.
    • Potential products: “Feedback Schema Registry”; SDKs for CRM/LMS/IDE integration; analytics for feedback quality and coverage.
    • Dependencies: Cross-organizational consensus; training data pipelines; UX that elicits useful critiques without high burden.
  • Sector-specific validated FCP deployments (healthcare, finance, legal)
    • Sector: healthcare, finance, legal
    • Description: Rigorous pilots where FCP is tuned with domain-expert feedback, measuring error reduction and compliance improvements while avoiding reward hacking.
    • Potential products: Clinical note assistants; regulatory report generators; case summarizers conditioned on reviewer feedback.
    • Dependencies: Expert-labeled datasets; auditability; bias/fairness assessments; regulatory approvals.
  • Agentic systems robust to deceptive or mixed feedback
    • Sector: autonomous agents, search/retrieval systems
    • Description: Research on detecting and mitigating deceptive/low-quality feedback, learning robust conditioning from noisy tool outputs and human critiques.
    • Potential products: “Feedback Robustifier” (quality filters, uncertainty models); trust-aware bootstrapping schedulers; safety monitors for agent loops.
    • Dependencies: Feedback quality estimation; adversarial testing; organizational safety policies.
  • Curriculum design for feedback conditioning and stability
    • Sector: ML research, MLOps
    • Description: Study curriculum schedules (e.g., avoid early over-emphasis on conciseness) to prevent length collapse and stabilize bootstrapping.
    • Potential products: “Condition Curriculum Planner”; length/verbosity controllers; training diagnostics that track accuracy vs. scalar scores vs. response length.
    • Dependencies: Longitudinal experimentation; task-dependent heuristics; deployment monitoring.
  • Benchmarks and metrics beyond scalar rewards
    • Sector: ML research, standardization bodies
    • Description: New evaluation sets capturing nuanced verbal feedback efficacy (e.g., accuracy under different conditions, controllability, robustness to mixed feedback).
    • Potential products: “Feedback-Conditioned Reasoning Benchmarks”; multi-dimensional metrics dashboards; shared leaderboards.
    • Dependencies: Community adoption; data curation; reproducible FCP baselines.
  • Privacy-preserving, on-device FCP
    • Sector: consumer devices, enterprise privacy
    • Description: Localized training from user feedback, enabling personalization without transferring raw critiques off-device.
    • Potential products: Edge FCP runtimes; federated bootstrapping; secure condition libraries.
    • Dependencies: Efficient fine-tuning (LoRA/PEFT), device compute; federated learning infrastructure; robust consent mechanisms.
  • Marketplace and tooling for feedback-as-a-signal
    • Sector: developer platforms, marketplaces
    • Description: Ecosystem around collecting, sharing, and leveraging feedback conditions (e.g., style packs, reviewer presets) for domain-specific models.
    • Potential products: “Condition Store” (e.g., compliance, editorial, tutoring styles); feedback aggregation services; API for feedback-conditioned inference.
    • Dependencies: IP/licensing norms; quality control; interoperability across model providers.

Cross-cutting assumptions and dependencies

  • Non-deceptive, sufficiently informative feedback: FCP depends on feedback that reflects actual quality or preferences; low-quality conditions degrade training.
  • Strong language priors: Base models must already interpret and compose feedback; larger, instruction-tuned LLMs help.
  • Data pipelines: Need robust collection of response–feedback pairs; attention to privacy, consent, and governance.
  • Stability under condition choice: Length-related conditions can destabilize bootstrapping (e.g., pushing excessive conciseness); curriculum and filtering may be required.
  • Reference policy and inference controllability: FCP relies on a reference model and user-specified “positive” conditions at inference; tooling should expose condition selection safely.
  • Domain validation: Safety-critical domains (healthcare, finance, legal) require rigorous evaluation, auditing, and regulatory compliance.
  • Avoiding scalar over-optimization: Benefits include reduced reward hacking, but organizations must adopt richer evaluation practices to replace scalar-only metrics.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Advantage estimation: Computing advantage values to guide policy updates in reinforcement learning. "GRPO instead uses group-normalized scalar rewards to estimate advantages and has become one of the strongest online methods, especially in math reasoning where answers can usually be verified automatically."
  • Behavior cloning: Supervised learning that imitates expert or reference behavior by directly learning from demonstrations. "We observe that our FCP learning in Eq.~(\ref{eq2}) aligns with modeling inverse dynamics~\citep{brandfonbrener2023inverse}, complementing supervised finetuning (SFT) as behavior cloning, and critique finetuning (CFT)~\citep{wang2025critique} as forward dynamics."
  • Behavior policy: The policy used to generate data (rollouts) during training. "Concretely, we conduct online training by sampling rollouts from the behavior policy πθ(,+)\pi_{\theta}( |, ^{+}) (goal-conditioned on positive feedback), and re-annotating them with fresh feedback to refine itself."
  • Bootstrapping (online): Iteratively improving a model by generating data with its current parameters and retraining on newly obtained feedback. "We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself."
  • Critique Finetuning (CFT): Finetuning a model using detailed critiques of its outputs to improve reasoning quality. "CFT can perform well with high-quality and detailed critiques~\citep{wang2025critique}, but applying it to the same coarse and lightweight feedback used for FCP leads to severe degradation—worse than the base model (Table~\ref{tab:main_result_math})."
  • Feedback-conditional policy (FCP): A policy that conditions generation on verbal feedback signals rather than scalar rewards. "Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP)."
  • Feedback-conditional posterior: The distribution over responses given both the instruction and observed feedback. "In the offline setting, where responses are collected from πref\pi_{\textrm{ref}}, we define the joint distribution of response-feedback pairs as Poff(y,cx)πref(yx)penv(cx,y)P_{\textrm{off}}(y,c|x)\triangleq \pi_{\textrm{ref}}(y|x)\cdot p_{\textrm{env}}(c|x,y), from which we derive the feedback-conditional posterior distribution:"
  • Forward dynamics: Modeling how actions/responses lead to outcomes or feedback, complementing inverse dynamics. "We observe that our FCP learning in Eq.~(\ref{eq2}) aligns with modeling inverse dynamics~\citep{brandfonbrener2023inverse}, complementing supervised finetuning (SFT) as behavior cloning, and critique finetuning (CFT)~\citep{wang2025critique} as forward dynamics."
  • Forward KL divergence: KL divergence computed as KL(P||Q), often used to match a target distribution with maximum likelihood. "To avoid intractability of computing logpenv(c+x,y)\log p_{\textrm{env}}(c^{+}|x,y) in the reverse KL divergence, we instead propose to minimize the forward KL divergence between π(yx,c+)\pi(y|x,c^{+}) and Poff(yx,c+)P_{\textrm{off}}(y|x,c^{+})."
  • Generative reward models: Models that produce verbalized critiques and/or scalar scores rather than binary correctness signals. "Such feedback may come from human users~\citep{stephan2024rlvf}, generative reward models~\citep{zhang2024generative,mahan2024generative}, or tool outputs in agentic scenarios~\citep{wang2025ragen,jin2025search}."
  • GRPO: An online RL algorithm that uses group-normalized scalar rewards to stabilize training and estimate advantages. "GRPO instead uses group-normalized scalar rewards to estimate advantages and has become one of the strongest online methods, especially in math reasoning where answers can usually be verified automatically."
  • Group-normalized rewards: Rewards normalized within groups to reduce scale imbalance and stabilize policy optimization. "GRPO instead uses group-normalized scalar rewards to estimate advantages and has become one of the strongest online methods, especially in math reasoning where answers can usually be verified automatically."
  • Instruction-tuned model: A model finetuned on instruction-following datasets to better respond to user prompts. "The reference policy πref\pi_{\textrm{ref}} may represent a base model, an instruction-tuned model, or a reasoning model."
  • Inverse dynamics: Learning to infer actions/responses from states and outcomes (feedback), the inverse of forward dynamics. "We observe that our FCP learning in Eq.~(\ref{eq2}) aligns with modeling inverse dynamics~\citep{brandfonbrener2023inverse}."
  • KL-constrained reward maximization: An RL objective maximizing expected reward while regularizing deviation from a reference policy via KL. "Following \citet{rafailov2023direct}, we show that Poff(yx,c+)P_{\textrm{off}}(y|x,c^{+}) is the optimal solution to a KL-constrained reward maximization problem with reward function logpenv(c+x,y)\log p_{\textrm{env}}(c^{+}|x,y):"
  • KL regularization: Penalizing divergence from a reference policy using KL to prevent overfitting or large policy shifts. "In the special case where the environment provides verifiable rewards... we can show that Poff(yx,c+)P_{\textrm{off}}(y|x, c^{+}) reduces to the optimal solution of a 0-1 binary reward maximization problem without KL regularization."
  • Language priors: Pretrained linguistic knowledge that enables models to generalize and recombine conditions or prompts. "Our approach is inspired by language priors in text-to-image generation, where models compose unseen prompts from mixed captions (Figure~\ref{demo2})."
  • Length bias: A training bias where longer or shorter sequences are favored due to aggregation in loss or reward. "In Algorithm~\ref{alg:online}, cross-entropy on self-sampled responses reduces to policy gradient with unit advantages, which suffers from length bias~\citep{liu2025understanding}."
  • Maximum likelihood training: Optimizing parameters to maximize the likelihood of observed data under the model. "This objective in Eq.~(\ref{eq2}) reduces to maximum likelihood training, which is straightforward to implement and optimize with data collected from πref(yx)\pi_{\textrm{ref}}(y|x) and penv(cx,y)p_{\textrm{env}}( c | x,y)."
  • Policy gradient: A class of RL methods that update policies via gradients of expected returns with respect to parameters. "In Algorithm~\ref{alg:online}, cross-entropy on self-sampled responses reduces to policy gradient with unit advantages."
  • Reference policy: A fixed or pretrained policy used as a prior or baseline for training and regularization. "We begin with a reference policy model πref\pi_{\textrm{ref}} that takes an input instruction xx and generates a response yπref(x)y \sim \pi_{\textrm{ref}}(\cdot|x)."
  • Rejection Sampling Finetuning (RFT): Offline finetuning by filtering and training only on high-quality or correct samples. "RFT filters responses by correctness and finetunes only on the correct ones, which in the offline case reduces to training on a binary scalar score (correct/incorrect)."
  • Reward hacking: Exploiting the reward model or metric to gain high scores without genuinely improving task performance. "This demonstrates a simple and scalable framework that preserves the richness of verbal feedback while avoiding the scarcity of rule-based verifiers and the risk of reward hacking."
  • Reward hypothesis: The RL assumption that goals can be captured by maximizing expected cumulative scalar rewards. "“That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).”"
  • Reverse KL divergence: KL divergence computed as KL(Q||P), often encouraging mode-seeking behavior in policy learning. "Note that the objective in Eq.~(\ref{eqRLKL}) is equivalent to minimizing the reverse KL divergence between π(yx,c+)\pi(y|x,c^{+}) and Poff(yx,c+)P_{\textrm{off}}(y|x,c^{+}):"
  • Scalarization: Converting rich feedback into single scalar reward values for RL, often causing information loss. "Scalarization has long been seen as unavoidable, bridging verbal feedback and the numerical signals required by RL."
  • Verifiable rewards: Rewards that can be deterministically checked (e.g., correctness) rather than inferred or subjective. "In the special case where the environment provides verifiable rewards, that is, penv(c+y+)=1p_{\textrm{env}}(c^{+}|y^{+})=1 for correct responses y+y^{+} and penv(c+y)=0p_{\textrm{env}}(c^{+}|y^{-})=0 for incorrect responses yy^{-}..."
  • vLLM: An efficient LLM inference engine enabling fast decoding and serving at scale. "Inference uses vLLM~\citep{kwon2023efficient} with greedy decoding and a maximum generation length of 8192 tokens."
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 posts and received 464 likes.