Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic-effect Semantic Reward (DSR)

Updated 8 January 2026
  • The paper introduces DSR, a mechanism that dynamically modulates rewards by coupling market returns with semantic alignment to suppress reward hacking.
  • It employs a triangular consistency metric and evidence retrieval to compute a semantic score that adaptively scales positive rewards and deepens penalties for negative outcomes.
  • Empirical results show DSR improves reasoning fidelity and cross-market generalization, offering significant variance reduction over traditional reward schemes.

Dynamic-effect Semantic Reward (DSR) is a reward integration mechanism designed for reinforcement learning (RL) in stochastic environments, specifically targeting the suppression of reward hacking and the amplification of genuine reasoning-driven behaviors in LLMs. As a core component of the Trade-R1 framework, DSR facilities robust process-level reasoning verification by coupling the magnitude of inherently noisy market returns with a learned semantic alignment score, thereby gating policy optimization by both financial outcome and reasoning fidelity (Sun et al., 7 Jan 2026).

1. Formal Definition

The Dynamic-effect Semantic Reward (DSR), denoted G(r,s)G(r, s), jointly incorporates the observed market reward rRr \in \mathbb{R} (e.g., 10-day excess return) and a semantic similarity score s[0,1]s \in [0,1]. The score ss is computed via a triangular consistency metric assessing the pairwise alignment between retrieved evidence, reasoning chains, and decisions. DSR is defined by a piecewise function:

G(r,s)={r×(0.5+s),r>0 r×(2s),r0G(r, s) = \begin{cases} r \times (0.5 + s), & r > 0 \ r \times (2 - s), & r \le 0 \end{cases}

This structure enables multiplicative modulation of the reward: positive returns are amplified in proportion to reasoning consistency, while negative returns incur greater penalties when semantic alignment is low.

2. Derivation and Integration Within Trade-R1

DSR is embedded in the Trade-R1 RL pipeline, which employs Group-normalized Proximal Policy Optimization (GRPO) to train a policy πθ\pi_\theta on context–return pairs. The implementation sequence is as follows:

  • Evidence Retrieval: For each rollout y=(c,d)y = (c, d) (reasoning chain cc and decision dd), a concise evidence set EE is extracted using string matching and semantic-embedding ranking (§3.5.1).
  • Triangular Consistency Scoring: Three LLM-Judge scores are computed:

    1. SE,cS_{E, c}: factuality of cc given EE
    2. Sc,dS_{c, d}: logical deduction of dd from cc
    3. SE,dS_{E, d}: consistency of dd with EE

The semantic similarity score is then s=SE,c+Sc,d+SE,d3s = \frac{S_{E, c} + S_{c, d} + S_{E, d}}{3}.

  • Reward Fusion: DSR is directly applied: Gi=G(ri,si)G_i = G(r_i, s_i) (§3.3.2).

  • Advantage Estimation: Rollouts are grouped, yielding Ai=Gimeanj(Gj)stdj(Gj)A_i = \frac{G_i - \operatorname{mean}_j(G_j)}{\operatorname{std}_j(G_j)}.

  • Policy Update: The GRPO surrogate objective is maximized with these normalized advantages.

DSR’s gating of policy gradients by both return magnitude and reasoning consistency distinguishes it from conventional RL reward structures in stochastic domains.

3. Training Pipeline in Stochastic Environments

The integration of DSR within RL for stochastic environments involves the following algorithmic steps per training iteration:

  1. Batch Sampling: Sample NN contexts {xn}\{x_n\}.

  2. Rollout Generation: For each xnx_n, generate GG rollouts yn,1Gπθ(old)y_{n,1\ldots G} \sim \pi_\theta(\text{old}).

  3. Reward Computation:

    • Execute action dd to obtain rn,gr_{n,g}.
    • Compute sn,gs_{n,g} via evidence retrieval and triangular scoring.
    • Apply Gn,g=G(rn,g,sn,g)G_{n,g} = G(r_{n,g}, s_{n,g}).
  4. Group Normalization: Calculate advantages An,1GA_{n,1\ldots G} across each group.
  5. Policy Optimization: Update θ\theta by maximizing the aggregated PPO objective.

This pipeline ensures each policy update is modulated by both market outcomes and process-level semantic verification.

4. Coupling Mechanism and Asymmetry

DSR distinguishes itself from the Fixed-effect Semantic Reward (FSR) through its dynamic, asymmetric coupling of reward magnitude and semantic alignment. While FSR simply adds a constant alignment bonus: GFSR(r,s)=r+αsG_\text{FSR}(r, s) = r + \alpha s (with α\alpha fixed, e.g., 2), DSR adaptively gates returns as follows:

  • For r>0r > 0: Amplification by a factor (0.5+s)[0.5,1.5](0.5 + s) \in [0.5, 1.5], rewarding high-fidelity reasoning (i.e., s1s \to 1).
  • For r0r \le 0: Penalty by (2s)[1,2](2 - s) \in [1, 2], imposing stricter penalties on low-alignment failures (s0s \to 0).

This asymmetric design prevents "penalty evasion," a phenomenon observed in symmetric schemes where noisy or hallucinated loss samples escape sufficient penalization.

5. Theoretical Guarantees

The DSR mechanism possesses several theoretical advantages (§3.4):

  • Variance Reduction for Spurious Profits: When r=r+ϵr = r^* + \epsilon with ϵN(0,σnoise2)\epsilon \sim \mathcal{N}(0, \sigma_\text{noise}^2) and negligible reasoning signal (r0r^* \approx 0, s0s \to 0), DSR applies a coefficient of $0.5$, reducing variance in the policy gradient by 75% compared to market-reward-only RL.
  • Signal Amplification for Valid Reasoning: For genuine, high-consistency reasoning (s1s \to 1, r>0r^* > 0), the 1.5x scaling boosts signal-to-noise ratio above market-only baselines.
  • Asymmetric Penalty Regularization: DSR enforces a 2×2\times penalty for negative, hallucinated outcomes with low ss, eliminating accumulation of ungrounded losses.

These properties collectively constrain overfitting to stochastic reward noise and promote grounded, evidence-consistent reasoning.

6. Comparative Analysis with FSR

The following table summarizes key contrasts between DSR and FSR in reward integration:

Criterion DSR FSR
Functional Form r×(0.5+s)r \times (0.5 + s) / r×(2s)r \times (2 - s) r+2sr + 2s
Return Modulation Adaptive, multiplicative, asymmetric Constant additive
Penalty Structure Amplifies penalties for low ss (double for r0,s0r \le 0, s \to 0) Uniform bonus, no extra penalty
Suppression of Noisy Rewards Yes (low ss \to strong down-weighting) No
Amplification of High-Alignment Gains Up to 1.5×1.5\times Fixed +2s+2s

DSR’s adaptive scaling suppresses low-trust noise-driven profits more effectively, while FSR’s additive structure risks alignment-floor exploitation during market drawdowns.

7. Empirical Performance and Ablation Findings

Empirical evaluation on the A-Share Market (July–Oct 2025) and US Market (out-of-distribution) settings demonstrates DSR’s efficacy:

Method Cumulative Return Sharpe Similarity Hallucination Rate
Market-Only 37.62% 3.028 0.437 22.5%
FSR 39.38% 3.065 0.956 0.39%
DSR 37.76% 3.036 0.974 0.12%

In OOD US Market tests:

  • Market-Only: 12.63% return, Sharpe 1.712, similarity 0.659
  • FSR: 11.40% return, Sharpe 1.473, similarity 0.758
  • DSR: 15.34% return, Sharpe 1.951, similarity 0.777

DSR produces the highest reasoning fidelity (similarity \approx 0.974) and minimal hallucination (\approx 0.001), while nearly matching or exceeding the in-distribution returns of FSR. Notably, DSR generalizes best out-of-distribution, outperforming both Market-Only and FSR on both financial and reasoning quality metrics.

Ablation studies confirm the necessity of DSR’s asymmetric formulation. Naïve multiplication (G(r,s)=rsG(r,s) = r \cdot s) results in penalty evasion (similarity 0.578 on loss samples), whereas DSR maintains similarity \approx 0.966. The effectiveness of the full two-stage process—retrieval-augmented evidence plus triangular scoring—is substantiated by sharp drops in cumulative return and similarity when replaced with a single-context judge.

8. Significance and Implications

Dynamic-effect Semantic Reward provides an empirically and theoretically grounded solution to the challenge of reward hacking in RL applications involving verifiable but noisy returns, such as financial decision-making. By coupling stochastic outcome magnitudes with process-level semantic verification via adaptive asymmetric modulation, DSR enforces grounded reasoning, suppresses spurious rewards, and supports robust cross-market generalization (Sun et al., 7 Jan 2026). A plausible implication is that similar asymmetric, semantically gated reward formulations may be applicable in other high-noise RL domains where reward hacking is prevalent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dynamic-effect Semantic Reward (DSR).