Bipolar Float Reward (BFR) Mechanism
- Bipolar Float Reward (BFR) is a reward structuring approach that assigns +1 for fully correct answers while mapping partial correctness to negative values.
- It overcomes the non-negative reward trap by providing a clear penalty for logical imperfections, thus enhancing training stability and convergence.
- The Split Q-Learning framework complements BFR by decomposing rewards into distinct gain and loss channels, enabling nuanced policy optimization.
The Bipolar Float Reward (BFR) mechanism is an advanced reward structuring paradigm designed to address critical bottlenecks in @@@@1@@@@ for general-purpose reasoning tasks, particularly within LLMs. Distinct from conventional binary or graded float rewards, BFR deploys a reward mapping that sharply distinguishes fully correct answers from all forms of logical imperfection, overcoming reward sparsity and sub-optimal policy attraction. In parallel, the Split Q-Learning framework implements a related "bipolar float-reward" design by decomposing reward feedback into explicit gain and loss channels with tunable bias and memory parameters, enabling nuanced control over agent reward processing. Both paradigms provide powerful machinery for modeling, optimizing, and analyzing reasoning policies in complex environments (Liu et al., 6 Jan 2026, Lin et al., 2019).
1. Motivation and Background
Traditional reinforcement learning methods typically utilize either binary rewards () or non-negative floats. Binary feedback is notoriously sparse, failing to guide policies toward incremental improvements—both a near-perfect solution and a completely incorrect answer receive if not flawless. Graded float rewards improve granularity but suffer from the "Non-negative Reward Trap": within a batch of non-negative scores, partially correct outputs can remain above the mean, inducing the policy to plateau at sub-optimal solutions rather than achieving full correctness. These issues are particularly severe in LLM-based reasoning tasks requiring multi-step logic and verification (Liu et al., 6 Jan 2026).
The Split Q-Learning framework addresses similar concerns from a behavioral perspective, proposing dual reward streams for separately tracking gains and losses—modeling neurobiological observations of distinct positive and negative learning pathways (Lin et al., 2019).
2. Formal Definition and Mathematical Framework
UltraLogic BFR Definition
Given a candidate answer and a task-specific correctness score , the BFR mechanism defines the scalar reward:
This mapping generates a reward signal in , sharply penalizing any imperfection and distinctly rewarding only fully correct solutions.
During Group Relative Policy Optimization (GRPO) updates, rewards propagate into the normalized advantage estimator:
where is the batch size, ensures numerical stability, and is the batch mean.
Split Q-Learning Bipolar Reward Structure
In Split Q-Learning, each reward observation is decomposed into gain and loss components:
Positive and negative returns are tracked in separate Q-tables:
- for gains,
- for losses.
The update rules are:
The action value for decision-making is fused:
Hyperparameters (memory-retention) and (reward-sensitivity) enable parametric control over agent behavior (Lin et al., 2019).
3. Graded Penalty and Correctness Schemes
UltraLogic supports four task-specific mappings for :
- Accuracy: Fraction of exact (sub)items correctly answered.
- F1-Score: For set/list outputs,
- Similarity: Normalized edit-distance or embedding similarity in .
- Absolute Difference Rate: For numeric tasks,
clamped to .
Imperfect answers—such as partial overlap in F1 tasks—receive a continuous penalty, e.g. an F1 prediction with yields , making the penalization directly proportional to the precision-recall gap. This scheme provides structured negative feedback that reflects gradations of incorrectness (Liu et al., 6 Jan 2026).
4. Integration in Training and Optimization Protocols
BFR integrates directly into GRPO-based policy optimization:
- Learning Rate:
- Rollout Length: 16
- Max Response Tokens: 32,768
- Temperature/Top-: 1.0
- Formatting Bonus: (logical and format rewards kept distinct)
- Difficulty-Matching: Examples are sampled according to capability-aligned difficulty, typically those yielding 40-60% success for base model.
BFR operates orthogonally to difficulty sampling: after task selection, its reward is computed via bipolar float mapping. No additional hyperparameters are introduced, and the system replaces standard or reward signals with the structure.
In Split Q-Learning, bias parameters allow modeling of gain/loss sensitivity analogous to psychological phenomena, with direct control of agent reward processing style (Lin et al., 2019).
5. Empirical Results and Analytical Findings
UltraLogic experiments demonstrate:
Accuracy Gains (Qwen3-8B, 2 epochs, easy-only data):
| Scheme | AIME24 | AIME25 | HMMT25 | BBH | BBEH | ARC-AGI |
|---|---|---|---|---|---|---|
| Binary (0/1) | 81.7 | 69.1 | 52.3 | 90.2 | 31.1 | 4.6 |
| Graded Float | 76.9 | 66.3 | 53.0 | 90.4 | 31.0 | 4.3 |
| Bipolar Float | 82.6 | 71.3 | 56.6 | 91.1 | 32.5 | 4.7 |
- BFR delivers marked improvements over baseline and graded methods.
- Training convergence is more stable and accelerates (≈20% faster mean-score ascent).
- Model-size and difficulty ablations show BFR+matching can yield up to +4 percentage points on strong benchmarks (Liu et al., 6 Jan 2026).
A plausible implication is that BFR's sharp negative penalty structure prevents models from stabilizing at partially correct modes and incentivizes exploration toward strict correctness, as supported by reduced training noise and higher optima.
Split Q-Learning analysis indicates that bias tuning in bipolar float channels models a broad range of behavioral reward-processing profiles and preserves convergence guarantees (Lin et al., 2019).
6. Practical Limits, Considerations, and Extensions
- Data Quality Sensitivity: Negative penalties amplify the risk posed by annotation errors or flawed solution logic, which may destabilize policy optimization. UltraLogic enforces total annotation and validation prior to training.
- Reward Scaling: The mapping is heuristic and not universally optimal; future exploration may involve dynamic slope for , positive intermediate tiers, or blending with step-level reward models.
- Dynamic Adaptation: Possible extensions include softening the reward cliff near perfect answers, task-adaptive scaling, or using trust-region constraints.
For Split Q-Learning, bias selection enables controlled replication of a spectrum of cognitive reward-processing modes (loss-aversion, reward-seeking, fast-forgetting), with direct correspondence to neurobiological research.
7. Contextual Significance and Cross-Domain Connections
BFR represents a shift toward highly informative, graded policy feedback that encourages global optimality in multi-step logic tasks. The bipolar decomposition in SQL establishes a framework compatible with behavioral and psychiatric modeling, enabling AI agents to adapt to disparate reward environments and real-world bias characteristics.
The convergence of BFR and split reward channels reflects broader trends toward adaptive, interpretable RL reward processing and highlights the role of penalty structure in scaling reasoning performance in large-scale models (Liu et al., 6 Jan 2026, Lin et al., 2019).