Fixed-effect Semantic Reward (FSR)
- Fixed-effect Semantic Reward (FSR) is a reward modulation scheme that integrates noisy market returns with a constant semantic bonus to stabilize reasoning in RL agents.
- It employs a triangular consistency metric to assess factuality, deduction, and consistency, ensuring high semantic alignment during decision-making.
- Empirical studies in the China A-Share market show improved returns and drastically reduced hallucination rates compared to market-only strategies.
Fixed-effect Semantic Reward (FSR) is a reward modulation scheme introduced in the Trade-R1 framework to address the challenge of training reinforcement learning (RL) agents, specifically LLMs, for financial decision-making in stochastic environments. FSR augments raw, noisy market returns with a constant semantic alignment incentive computed via a structured process-level reasoning verification metric. This approach aims to stabilize reasoning quality by decoupling semantic fidelity from return volatility, preserving reasoning alignment even under adverse market conditions (Sun et al., 7 Jan 2026).
1. Formal Definition and Mathematical Formulation
FSR operates in a contextual RL setting, formalized as follows. Let denote the market context, comprising financial news and data, and let be a policy that generates a reasoning chain and a final decision . The observed market reward is given by , typically a multi-day excess return from historical backtesting. A semantic similarity score , computed via the triangular consistency metric, quantifies the alignment between evidence, reasoning, and decision.
FSR is defined by the reward integration function:
where is a fixed hyperparameter controlling the weight of semantic alignment. In Trade-R1, is set to $2$, yielding
The RL objective becomes maximizing the expectation:
2. Triangular Consistency Metric and Semantic Scoring
FSR leverages a triangular consistency metric to compute the semantic similarity score . For each sample, a concise evidence snippet relevant to the stock pick is retrieved. An LLM-based judge then scores the following terms, each in :
- Factuality: (“Does the reasoning faithfully reflect the facts in ?”)
- Deduction: (“Does the final decision logically follow from ?”)
- Consistency: (“Is supported by ?”)
The overall semantic score is the arithmetic mean:
In FSR, this score is scaled by the fixed coefficient and added to the raw market return, providing a constant step-wise semantic incentive regardless of return magnitude or sign.
3. Rationale for Fixed-Effect Formulation
FSR's fixed-effect design is motivated by multiple factors:
- Stability under Volatility: Financial returns exhibit noisy behavior that can obscure semantic alignment signals. The fixed bonus ensures the agent consistently receives semantic grounding encouragement, even when is small or highly volatile.
- Decoupled Alignment Gradient: The policy gradient separates into (traditional policy gradient) and a constant for reasoning, preventing market noise from distorting the semantic alignment update.
- Simplicity: A single hyperparameter controls the trade-off between economic returns and reasoning fidelity. There is no conditional scaling or piecewise treatment based on or .
A plausible implication is that this simplicity makes the system robust to hyperparameter tuning and easier to analyze, but may limit expressiveness in settings where adaptivity is required.
4. Integration with Trade-R1 RL Workflow
FSR is incorporated into Trade-R1 via a grouped-sampling variant of Proximal Policy Optimization (GRPO). The training workflow proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Initialize policy πθ
for each training batch do
sample a batch of contexts {xᵢ}
for each xᵢ:
generate G rollouts yᵢ₁,…,yᵢG ∼ π_{θ_old}(.|xᵢ)
for each sample yᵢⱼ:
compute market return rᵢⱼ (backtest)
retrieve evidence Eᵢⱼ via embedding & reranking
score (S_{E↔c}, S_{c↔d}, S_{E↔d}) with LLM judge → sᵢⱼ=(…)/3
Gᵢⱼ = rᵢⱼ + 2·sᵢⱼ // Fixed-effect semantic reward
end
compute group-normalized advantages:
Aᵢⱼ = (Gᵢⱼ − meanₖ Gᵢₖ)/stdₖ Gᵢₖ
take PPO-style gradient step on θ using {Aᵢⱼ, log πθ(yᵢⱼ|xᵢ)}
end
end |
Key points:
- The additive term applies uniformly, irrespective of return volatility or sign.
- Group normalization mitigates non-stationary trends in market returns prior to advantage computation, ensuring stable updates.
5. Empirical Performance and Illustrative Case Study
FSR's efficacy is demonstrated in the China A-Share market (July–October 2025):
- Market-Only Policy: Cumulative return 37.62%; Sharpe ratio 3.028; Semantic Similarity ; Hallucination rate $0.225$.
- FSR-Augmented Policy: Cumulative return 39.38%; Sharpe ratio ; Mean triangle-score ; Hallucination rate $0.004$.
The net asset value (NAV) curve under FSR remains consistently higher than market-only, with dramatically improved reasoning fidelity and reduced hallucination (unsupported reasoning chains). This suggests FSR is highly effective for in-distribution semantic grounding.
6. Benefits, Limitations, and Cross-Market Observations
Benefits:
- Strong, stable in-distribution returns, paired with extremely high semantic alignment ().
- The constant semantic term prevents the model from disregarding reasoning integrity in periods of low or negative returns.
Limitations:
- The additive nature of the semantic reward can be under-weighted when : models may exploit this by lowering to reduce cumulative penalties, impairing out-of-distribution robustness.
- Reduced cross-market generalization: In US market tests, FSR yields lower returns than both Market-Only and Dynamic-effect Semantic Reward (DSR), coupled with only moderate gains in reasoning consistency.
- FSR does not suppress excess variance from noisy positive returns, in contrast to DSR's adaptive regularization.
Empirical Comparison Table
| Setting | Strategy | Cumulative Return (%) | Sharpe Ratio | Semantic Similarity | Hallucination Rate |
|---|---|---|---|---|---|
| China A-Share (test) | Market-Only | 37.62 | 3.028 | 0.437 | 0.225 |
| FSR | 39.38 | 3.065 | 0.956 | 0.0039 | |
| US Market (OOD) | Market-Only | 12.63 | 1.712 | 0.659 | 0.141 |
| FSR | 11.40 | 1.473 | 0.758 | 0.092 |
A plausible implication is that FSR is best suited to environments where high reasoning fidelity is prioritized and the reward distribution is not adversarial; its lack of adaptivity may constrain performance in more volatile or distribution-shifting markets.
7. Significance and Theoretical Implications
Fixed-effect Semantic Reward establishes a straightforward mechanism for enforcing reasoning integrity in RL agents operating under noisy reward regimes. By championing process-level verification, it supports semantic alignment at each decision step, unaffected by economic reward fluctuations. Its major trade-off is simplicity versus adaptivity: FSR delivers high in-distribution fidelity but falls short as environments change or stochasticity increases. Dynamic-effect alternatives, such as DSR, may provide superior robust generalization and regularize variance, though with greater operational complexity.
FSR marks an important step in integrating retrieval-based reasoning verification with RL reward design, especially for high-stakes, stochastic applications such as financial asset selection (Sun et al., 7 Jan 2026).