Dynamic-effect Semantic Reward (DSR)
- The paper introduces DSR, a mechanism that dynamically modulates rewards by coupling market returns with semantic alignment to suppress reward hacking.
- It employs a triangular consistency metric and evidence retrieval to compute a semantic score that adaptively scales positive rewards and deepens penalties for negative outcomes.
- Empirical results show DSR improves reasoning fidelity and cross-market generalization, offering significant variance reduction over traditional reward schemes.
Dynamic-effect Semantic Reward (DSR) is a reward integration mechanism designed for reinforcement learning (RL) in stochastic environments, specifically targeting the suppression of reward hacking and the amplification of genuine reasoning-driven behaviors in LLMs. As a core component of the Trade-R1 framework, DSR facilities robust process-level reasoning verification by coupling the magnitude of inherently noisy market returns with a learned semantic alignment score, thereby gating policy optimization by both financial outcome and reasoning fidelity (Sun et al., 7 Jan 2026).
1. Formal Definition
The Dynamic-effect Semantic Reward (DSR), denoted , jointly incorporates the observed market reward (e.g., 10-day excess return) and a semantic similarity score . The score is computed via a triangular consistency metric assessing the pairwise alignment between retrieved evidence, reasoning chains, and decisions. DSR is defined by a piecewise function:
This structure enables multiplicative modulation of the reward: positive returns are amplified in proportion to reasoning consistency, while negative returns incur greater penalties when semantic alignment is low.
2. Derivation and Integration Within Trade-R1
DSR is embedded in the Trade-R1 RL pipeline, which employs Group-normalized Proximal Policy Optimization (GRPO) to train a policy on context–return pairs. The implementation sequence is as follows:
- Evidence Retrieval: For each rollout (reasoning chain and decision ), a concise evidence set is extracted using string matching and semantic-embedding ranking (§3.5.1).
- Triangular Consistency Scoring: Three LLM-Judge scores are computed:
- : factuality of given
- : logical deduction of from
- : consistency of with
The semantic similarity score is then .
Reward Fusion: DSR is directly applied: (§3.3.2).
Advantage Estimation: Rollouts are grouped, yielding .
Policy Update: The GRPO surrogate objective is maximized with these normalized advantages.
DSR’s gating of policy gradients by both return magnitude and reasoning consistency distinguishes it from conventional RL reward structures in stochastic domains.
3. Training Pipeline in Stochastic Environments
The integration of DSR within RL for stochastic environments involves the following algorithmic steps per training iteration:
Batch Sampling: Sample contexts .
Rollout Generation: For each , generate rollouts .
Reward Computation:
- Execute action to obtain .
- Compute via evidence retrieval and triangular scoring.
- Apply .
- Group Normalization: Calculate advantages across each group.
- Policy Optimization: Update by maximizing the aggregated PPO objective.
This pipeline ensures each policy update is modulated by both market outcomes and process-level semantic verification.
4. Coupling Mechanism and Asymmetry
DSR distinguishes itself from the Fixed-effect Semantic Reward (FSR) through its dynamic, asymmetric coupling of reward magnitude and semantic alignment. While FSR simply adds a constant alignment bonus: (with fixed, e.g., 2), DSR adaptively gates returns as follows:
- For : Amplification by a factor , rewarding high-fidelity reasoning (i.e., ).
- For : Penalty by , imposing stricter penalties on low-alignment failures ().
This asymmetric design prevents "penalty evasion," a phenomenon observed in symmetric schemes where noisy or hallucinated loss samples escape sufficient penalization.
5. Theoretical Guarantees
The DSR mechanism possesses several theoretical advantages (§3.4):
- Variance Reduction for Spurious Profits: When with and negligible reasoning signal (, ), DSR applies a coefficient of $0.5$, reducing variance in the policy gradient by 75% compared to market-reward-only RL.
- Signal Amplification for Valid Reasoning: For genuine, high-consistency reasoning (, ), the 1.5x scaling boosts signal-to-noise ratio above market-only baselines.
- Asymmetric Penalty Regularization: DSR enforces a penalty for negative, hallucinated outcomes with low , eliminating accumulation of ungrounded losses.
These properties collectively constrain overfitting to stochastic reward noise and promote grounded, evidence-consistent reasoning.
6. Comparative Analysis with FSR
The following table summarizes key contrasts between DSR and FSR in reward integration:
| Criterion | DSR | FSR |
|---|---|---|
| Functional Form | / | |
| Return Modulation | Adaptive, multiplicative, asymmetric | Constant additive |
| Penalty Structure | Amplifies penalties for low (double for ) | Uniform bonus, no extra penalty |
| Suppression of Noisy Rewards | Yes (low strong down-weighting) | No |
| Amplification of High-Alignment Gains | Up to | Fixed |
DSR’s adaptive scaling suppresses low-trust noise-driven profits more effectively, while FSR’s additive structure risks alignment-floor exploitation during market drawdowns.
7. Empirical Performance and Ablation Findings
Empirical evaluation on the A-Share Market (July–Oct 2025) and US Market (out-of-distribution) settings demonstrates DSR’s efficacy:
| Method | Cumulative Return | Sharpe | Similarity | Hallucination Rate |
|---|---|---|---|---|
| Market-Only | 37.62% | 3.028 | 0.437 | 22.5% |
| FSR | 39.38% | 3.065 | 0.956 | 0.39% |
| DSR | 37.76% | 3.036 | 0.974 | 0.12% |
In OOD US Market tests:
- Market-Only: 12.63% return, Sharpe 1.712, similarity 0.659
- FSR: 11.40% return, Sharpe 1.473, similarity 0.758
- DSR: 15.34% return, Sharpe 1.951, similarity 0.777
DSR produces the highest reasoning fidelity (similarity 0.974) and minimal hallucination ( 0.001), while nearly matching or exceeding the in-distribution returns of FSR. Notably, DSR generalizes best out-of-distribution, outperforming both Market-Only and FSR on both financial and reasoning quality metrics.
Ablation studies confirm the necessity of DSR’s asymmetric formulation. Naïve multiplication () results in penalty evasion (similarity 0.578 on loss samples), whereas DSR maintains similarity 0.966. The effectiveness of the full two-stage process—retrieval-augmented evidence plus triangular scoring—is substantiated by sharp drops in cumulative return and similarity when replaced with a single-context judge.
8. Significance and Implications
Dynamic-effect Semantic Reward provides an empirically and theoretically grounded solution to the challenge of reward hacking in RL applications involving verifiable but noisy returns, such as financial decision-making. By coupling stochastic outcome magnitudes with process-level semantic verification via adaptive asymmetric modulation, DSR enforces grounded reasoning, suppresses spurious rewards, and supports robust cross-market generalization (Sun et al., 7 Jan 2026). A plausible implication is that similar asymmetric, semantically gated reward formulations may be applicable in other high-noise RL domains where reward hacking is prevalent.