Programming by Rewards (PBR)
- Programming by Rewards (PBR) is a framework that uses scalar reward signals instead of explicit labels to synthesize programs and policies, integrating reinforcement learning and program induction.
- It employs methodologies such as programmatic reward scripting, preference-based models, and example-driven surrogates to optimize complex decision processes.
- PBR addresses challenges like reward misspecification and exploration stability, offering actionable insights for efficient program synthesis and robust alignment.
Programming by Rewards (PBR) is a paradigm for program synthesis and sequential decision-making in which the specification, optimization, or induction of programs is governed not by explicit supervised labels but by scalar reward signals or reward-based feedback. Modern PBR encompasses a variety of methodologies, including programmatic reward scripting, optimization over black-box reward functions, preference-based and outcome/process-based evaluation, and program induction via success examples. It bridges classical supervised program synthesis, reinforcement learning (RL), and inverse RL while facilitating learning, alignment, or strategy extraction in domains where explicit ground-truth demonstrations or direct symbolic supervision are infeasible, undesirable, or ill-defined.
1. Formal Definitions and Core Principles
PBR is characterized by the abstraction of rewards as the driving force for induction, optimization, or search over programs, policies, or strategies:
- General specification: Given input features , the goal is to synthesize a program or policy (or ) that maximizes the expected reward when composed with a black-box reward function , i.e., or, in the sequential RL case, maximizing where may be explicit, programmatic, implicit, or defined by preferences, examples, or logical specification (Natarajan et al., 2020, Zhou et al., 2021, Abolafia et al., 2018, Donnelly et al., 17 Oct 2025).
- Reward specification: PBR accepts a wide range of reward sources:
- Black-box scalar rewards returned by executing decisions or program outputs (Natarajan et al., 2020, Abolafia et al., 2018).
- Preference signals over entire or partial trajectories (e.g., pairwise comparisons, process-based scoring) (Zhan et al., 2023, Groeneveld et al., 27 Oct 2025, Fan et al., 7 Aug 2025).
- Explicit code or logical monitors (as in DSLs or formal language grammars) (Zhou et al., 2021, Donnelly et al., 17 Oct 2025).
- Empirically, even user-provided success examples as data-driven reward surrogates (Eysenbach et al., 2021).
- Program space: May be defined by loop-free if-then-else trees (Natarajan et al., 2020), general program sketches or domain-specific languages (DSLs) (Zhou et al., 2021), neural sequence models (Abolafia et al., 2018), or context-free grammars for strategies (Correa et al., 2024).
2. Programmatic, Preference, and Example-Based Reward Specification
Programmatic reward scripting:
Human engineers specify reward functions as structured programs (possibly in a DSL) that encode subgoals, temporal logic, stateful predicates, or complex task decompositions. These are used both for direct RL and for inferring parameters (so-called "hole-filling") from minimal expert demonstrations (Zhou et al., 2021, Donnelly et al., 17 Oct 2025). DSLs such as "Imp" (Natarajan et al., 2020) and extensions using RML (Donnelly et al., 17 Oct 2025) provide loop-free and logic-rich representations, supporting formal compilation into reward machines or state automata.
Preference-based reward models:
PBR extends to preference-based RL (PbRL), which dispenses with explicit reward functions in favor of binary or scalar preferences provided by humans or automated oracles over pairs of trajectories, sub-trajectories, or reasoning traces. This enables efficient alignment in scenarios where scalar rewards are hard to calibrate or susceptible to reward hacking (Zhan et al., 2023, Groeneveld et al., 27 Oct 2025). Modern frameworks incorporate batch or offline preference collection as well as inference of underlying linear or non-linear reward parameterizations with provable sample-complexity improvements.
Example-based reward surrogates:
A further relaxation replaces reward code with positive outcome examples: the agent learns to maximize the probability of achieving states matching these examples, via recursive classification or other value-based surrogates (Eysenbach et al., 2021). This removes hand-engineered reward code entirely, using data-driven Bellman equations and avoiding the need for learned reward models.
3. Methodologies: Optimization, Synthesis, and Learning Procedures
PBR admits diverse optimization procedures:
- Continuous optimization with program parameterizations: Approaches such as (Natarajan et al., 2020) embed discrete program structures into continuous parameterizations (e.g., neural networks approximating decision-trees) and use bandit gradient descent or policy gradients for sample-efficient search.
- Priority Queue Training (PQT): Maintains a buffer of top- highest-reward programs, trains a neural model to imitate these, and iteratively improves the program population. This regime combines supervised log-likelihood updates on the elite set and policy-gradient exploration, yielding improved likelihood of high-reward synthesis and advantageous trade-offs between exploration and exploitation (Abolafia et al., 2018).
- Bayesian program induction: Defines a posterior over strategies or programs, optimizing for both expected cumulative reward and simplicity, e.g., via Metropolis–Hastings or probabilistic sampling over a context-free grammar (Correa et al., 2024). This admits Pareto frontiers reflecting the trade-off between interpretability and effectiveness.
- Preference-based reward inference: Algorithms such as REGIME (Zhan et al., 2023) separate exploration of the feature/state-action space from preference querying, allowing sample-efficient estimation of underlying reward functions with theoretical guarantees, particularly in linearly parameterized MDPs.
- Logic-based reward synthesis and reward machines: Utilizing RML or similar formal languages, reward functions are compiled to automata, enabling transparent integration with standard RL while supporting rich non-Markovian and parametric temporal specifications (Donnelly et al., 17 Oct 2025).
4. Practical Applications and Domains
PBR methods have been grounded in a variety of empirical settings:
- Heuristic and search function synthesis: PBR has been applied to tune search and ranking heuristics in industrial program synthesis frameworks (e.g., Microsoft PROSE), often outperforming bandit-based or standard RL methods in sample complexity and matching or exceeding hand-tuned accuracy (Natarajan et al., 2020).
- Algorithmic program synthesis: Neural program synthesis in Turing-complete languages with sparse and black-box rewards benefits from PQT-based PBR, yielding superior synthesis rates and human-interpretable solutions relative to genetic algorithms and policy-gradient methods (Abolafia et al., 2018).
- Preference-learning in LLMs and code generation: Fine-tuning LLMs via process- and outcome-based rewards, explicitly separated and estimated in architectures such as Phi-4, results in notable gains in identifying highly correct generations ("Pass@1" and "Pass@3" improvements of up to 22%) and reliable step-wise evaluation for pruning and early stopping (Groeneveld et al., 27 Oct 2025, Fan et al., 7 Aug 2025).
- Strategy induction and cognitive modeling: Bayesian program induction over interpretable strategy grammars efficiently captures both simple and complex exploration heuristics in classical and non-stationary bandit settings, exposing resource-rational trade-offs akin to those observed in animal and human learning (Correa et al., 2024).
- RL from example outcomes: Recursive classification methods surpass adversarial and explicit reward learning approaches in RL domains ranging from low-dimensional state tasks to high-dimensional visual manipulation, leveraging only a small set of success snapshots rather than demonstration trajectories or reward code (Eysenbach et al., 2021).
- Language-based reward specification: Non-Markovian, temporal, and counting behaviors are concisely and scalably encoded in RML, compiled to automata (Reward Machines) and used to specify structured reward signals for RL agents. This method supports richer expressivity and interpretable subgoal tracking versus standard black-box reward functions (Donnelly et al., 17 Oct 2025).
5. Reward Design, Sample Complexity, and Theoretical Foundations
Designing reward functions under PBR is explicitly framed as an optimization problem over learning speed, action gap, and effective planning horizon (Sowerby et al., 2022):
- Action gap maximization: Ensures the value difference between optimal and suboptimal actions is maximized, mitigating overfitting and enabling faster learner convergence.
- Subjective discount minimization: Drives the minimal effective discount factor at which the target policy remains optimal, reducing required lookahead and yielding faster Q-learning or R-max convergence.
- LP-based reward selection: Algorithms solve linear programs at each candidate discount, identifying rewards that maximize the minimal action gap and minimize subjective discount, with empirical results confirming substantial improvements over naive or hand-tuned dense rewards.
- General reward-design principles: These include penalizing every step, assigning increasing reward magnitude to subgoals, and optimizing dense rewards carefully by jointly considering action gap and planning horizon.
- Sample complexity and human feedback: Frameworks such as REGIME in preference-based PBR decouple exploration and querying, achieving query complexities that scale with feature dimension rather than state/action cardinality, and offering explicit theoretical guarantees on value-suboptimality and sample requirements (Zhan et al., 2023).
6. Challenges, Limitations, and Practical Guidelines
Despite its flexibility, PBR faces multiple challenges:
- Reward misspecification and hacking: The separation of outcome/process indicators and the use of posterior gating for process rewards, as in Posterior-GRPO (Fan et al., 7 Aug 2025), are critical for avoiding pathological optimizations in LLMs and policy models.
- Exploration stability: Direct RL with programmed rewards on formal language tasks and LLMs can be unstable or inadequate for learning entirely new domains, even where programmatic grading is available. Entropy regularization, pre-training, and hybrid search/RL techniques may be necessary (Padula et al., 2024).
- Computational cost: Reward evaluation may require running full program executions, simulators, or automata, creating a bottleneck for both training speed and practical sample complexity (Natarajan et al., 2020, Donnelly et al., 17 Oct 2025).
- Interpretability-resource trade-off: Bayesian program induction approaches naturally illuminate the cost-accuracy Pareto frontier, but in richer domains, amortization, grammar refinement, or meta-learning are needed for scalability (Correa et al., 2024).
- Consistency and generalization: Transferring learned reward programs or strategies across domain configurations is possible in well-structured PBR, but depends on the sketch or grammatical assumptions (Zhou et al., 2021).
Practical guidelines emphasize the following (Sowerby et al., 2022, Padula et al., 2024):
- Select a minimal feature set or sketch structure for the task.
- Use LP or preference-based reward inference for robust, fast learning.
- Validate reward specification via action gap and effective horizon metrics.
- Incorporate batch or process rewards, entropy regularization, and partial sketches for alignment and exploration stability.
- Prefer interpretable and parametric reward specifications where possible, enabling debugging and formal verification.
7. Future Directions and Extensions
Research continues to expand the expressiveness and robustness of PBR:
- Meta-sketch and automated DSL induction: Automatic generation of reward sketches or DSLs from high-level specifications (e.g., via LTL or natural language) is an ongoing research area (Zhou et al., 2021).
- Reward learning from examples and event-based RL: The generalization of PBR to mixture events, partial/uncertain success labels, and goal-conditioned policies promises broad applicability in offline RL, vision-based control, and across mixed-modality tasks (Eysenbach et al., 2021).
- Hierarchical and multi-level PBR: Maintaining multiple priority queues for sub-programs, integrating hierarchical RL with PBR reward managers, and combining search and RL with process/outcome critics are active lines of work (Abolafia et al., 2018, Groeneveld et al., 27 Oct 2025, Fan et al., 7 Aug 2025).
- Formal synthesis, verification and monitoring: Integration of reward automata with formal verification, as well as scaling RML-based methods to larger and dynamically specified tasks, are being developed (Donnelly et al., 17 Oct 2025).
- Process-aligned LLM fine-tuning: Expanding process- and outcome-based reward assignment to larger-context LLMs, iterative self-improvement, and dynamic task-aware reward discounting are priorities for next-generation LLM alignment (Fan et al., 7 Aug 2025).
PBR thus constitutes a versatile framework—encompassing program synthesis, decision-making, strategy induction, and reward specification—that is grounded in formal optimization and learning theory, robust to diverse reward input modalities, and empirically validated across a spectrum of synthetic and real-world domains.