Dynamic-effect Semantic Reward (DSR)

Updated 8 January 2026

The paper introduces DSR, a mechanism that dynamically modulates rewards by coupling market returns with semantic alignment to suppress reward hacking.
It employs a triangular consistency metric and evidence retrieval to compute a semantic score that adaptively scales positive rewards and deepens penalties for negative outcomes.
Empirical results show DSR improves reasoning fidelity and cross-market generalization, offering significant variance reduction over traditional reward schemes.

Dynamic-effect Semantic Reward (DSR) is a reward integration mechanism designed for reinforcement learning (RL) in stochastic environments, specifically targeting the suppression of reward hacking and the amplification of genuine reasoning-driven behaviors in LLMs. As a core component of the Trade-R1 framework, DSR facilities robust process-level reasoning verification by coupling the magnitude of inherently noisy market returns with a learned semantic alignment score, thereby gating policy optimization by both financial outcome and reasoning fidelity (Sun et al., 7 Jan 2026).

1. Formal Definition

The Dynamic-effect Semantic Reward (DSR), denoted $G(r, s)$ , jointly incorporates the observed market reward $r \in \mathbb{R}$ (e.g., 10-day excess return) and a semantic similarity score $s \in [0,1]$ . The score $s$ is computed via a triangular consistency metric assessing the pairwise alignment between retrieved evidence, reasoning chains, and decisions. DSR is defined by a piecewise function:

$G(r, s) = \begin{cases} r \times (0.5 + s), & r > 0 \ r \times (2 - s), & r \le 0 \end{cases}$

This structure enables multiplicative modulation of the reward: positive returns are amplified in proportion to reasoning consistency, while negative returns incur greater penalties when semantic alignment is low.

2. Derivation and Integration Within Trade-R1

DSR is embedded in the Trade-R1 RL pipeline, which employs Group-normalized Proximal Policy Optimization (GRPO) to train a policy $\pi_\theta$ on context–return pairs. The implementation sequence is as follows:

Evidence Retrieval: For each rollout $y = (c, d)$ (reasoning chain $c$ and decision $d$ ), a concise evidence set $E$ is extracted using string matching and semantic-embedding ranking (§3.5.1).
Triangular Consistency Scoring: Three LLM-Judge scores are computed:
1. $S_{E, c}$ : factuality of $c$ given $E$
2. $S_{c, d}$ : logical deduction of $d$ from $c$
3. $S_{E, d}$ : consistency of $d$ with $E$

The semantic similarity score is then $s = \frac{S_{E, c} + S_{c, d} + S_{E, d}}{3}$ .

Reward Fusion: DSR is directly applied: $G_i = G(r_i, s_i)$ (§3.3.2).
Advantage Estimation: Rollouts are grouped, yielding $A_i = \frac{G_i - \operatorname{mean}_j(G_j)}{\operatorname{std}_j(G_j)}$ .
Policy Update: The GRPO surrogate objective is maximized with these normalized advantages.

DSR’s gating of policy gradients by both return magnitude and reasoning consistency distinguishes it from conventional RL reward structures in stochastic domains.

3. Training Pipeline in Stochastic Environments

The integration of DSR within RL for stochastic environments involves the following algorithmic steps per training iteration:

Batch Sampling: Sample $N$ contexts $\{x_n\}$ .
Rollout Generation: For each $x_n$ , generate $G$ rollouts $y_{n,1\ldots G} \sim \pi_\theta(\text{old})$ .
Reward Computation:
- Execute action $d$ to obtain $r_{n,g}$ .
- Compute $s_{n,g}$ via evidence retrieval and triangular scoring.
- Apply $G_{n,g} = G(r_{n,g}, s_{n,g})$ .
Group Normalization: Calculate advantages $A_{n,1\ldots G}$ across each group.
Policy Optimization: Update $\theta$ by maximizing the aggregated PPO objective.

This pipeline ensures each policy update is modulated by both market outcomes and process-level semantic verification.

4. Coupling Mechanism and Asymmetry

DSR distinguishes itself from the Fixed-effect Semantic Reward (FSR) through its dynamic, asymmetric coupling of reward magnitude and semantic alignment. While FSR simply adds a constant alignment bonus: $G_\text{FSR}(r, s) = r + \alpha s$ (with $\alpha$ fixed, e.g., 2), DSR adaptively gates returns as follows:

For $r > 0$ : Amplification by a factor $(0.5 + s) \in [0.5, 1.5]$ , rewarding high-fidelity reasoning (i.e., $s \to 1$ ).
For $r \le 0$ : Penalty by $(2 - s) \in [1, 2]$ , imposing stricter penalties on low-alignment failures ( $s \to 0$ ).

This asymmetric design prevents "penalty evasion," a phenomenon observed in symmetric schemes where noisy or hallucinated loss samples escape sufficient penalization.

5. Theoretical Guarantees

The DSR mechanism possesses several theoretical advantages (§3.4):

Variance Reduction for Spurious Profits: When $r = r^* + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma_\text{noise}^2)$ and negligible reasoning signal ( $r^* \approx 0$ , $s \to 0$ ), DSR applies a coefficient of $0.5$, reducing variance in the policy gradient by 75% compared to market-reward-only RL.
Signal Amplification for Valid Reasoning: For genuine, high-consistency reasoning ( $s \to 1$ , $r^* > 0$ ), the 1.5x scaling boosts signal-to-noise ratio above market-only baselines.
Asymmetric Penalty Regularization: DSR enforces a $2\times$ penalty for negative, hallucinated outcomes with low $s$ , eliminating accumulation of ungrounded losses.

These properties collectively constrain overfitting to stochastic reward noise and promote grounded, evidence-consistent reasoning.

6. Comparative Analysis with FSR

The following table summarizes key contrasts between DSR and FSR in reward integration:

Criterion	DSR	FSR
Functional Form	$r \times (0.5 + s)$ / $r \times (2 - s)$	$r + 2s$
Return Modulation	Adaptive, multiplicative, asymmetric	Constant additive
Penalty Structure	Amplifies penalties for low $s$ (double for $r \le 0, s \to 0$ )	Uniform bonus, no extra penalty
Suppression of Noisy Rewards	Yes (low $s \to$ strong down-weighting)	No
Amplification of High-Alignment Gains	Up to $1.5\times$	Fixed $+2s$

DSR’s adaptive scaling suppresses low-trust noise-driven profits more effectively, while FSR’s additive structure risks alignment-floor exploitation during market drawdowns.

7. Empirical Performance and Ablation Findings

Empirical evaluation on the A-Share Market (July–Oct 2025) and US Market (out-of-distribution) settings demonstrates DSR’s efficacy:

Method	Cumulative Return	Sharpe	Similarity	Hallucination Rate
Market-Only	37.62%	3.028	0.437	22.5%
FSR	39.38%	3.065	0.956	0.39%
DSR	37.76%	3.036	0.974	0.12%

In OOD US Market tests:

Market-Only: 12.63% return, Sharpe 1.712, similarity 0.659
FSR: 11.40% return, Sharpe 1.473, similarity 0.758
DSR: 15.34% return, Sharpe 1.951, similarity 0.777

DSR produces the highest reasoning fidelity (similarity $\approx$ 0.974) and minimal hallucination ( $\approx$ 0.001), while nearly matching or exceeding the in-distribution returns of FSR. Notably, DSR generalizes best out-of-distribution, outperforming both Market-Only and FSR on both financial and reasoning quality metrics.

Ablation studies confirm the necessity of DSR’s asymmetric formulation. Naïve multiplication ( $G(r,s) = r \cdot s$ ) results in penalty evasion (similarity 0.578 on loss samples), whereas DSR maintains similarity $\approx$ 0.966. The effectiveness of the full two-stage process—retrieval-augmented evidence plus triangular scoring—is substantiated by sharp drops in cumulative return and similarity when replaced with a single-context judge.

8. Significance and Implications

Dynamic-effect Semantic Reward provides an empirically and theoretically grounded solution to the challenge of reward hacking in RL applications involving verifiable but noisy returns, such as financial decision-making. By coupling stochastic outcome magnitudes with process-level semantic verification via adaptive asymmetric modulation, DSR enforces grounded reasoning, suppresses spurious rewards, and supports robust cross-market generalization (Sun et al., 7 Jan 2026). A plausible implication is that similar asymmetric, semantically gated reward formulations may be applicable in other high-noise RL domains where reward hacking is prevalent.

PDF Markdown Chat (Pro)

References (1)

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dynamic-effect Semantic Reward (DSR).

Dynamic-effect Semantic Reward (DSR)

1. Formal Definition

2. Derivation and Integration Within Trade-R1

3. Training Pipeline in Stochastic Environments

4. Coupling Mechanism and Asymmetry

5. Theoretical Guarantees

6. Comparative Analysis with FSR

7. Empirical Performance and Ablation Findings

8. Significance and Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Dynamic-effect Semantic Reward (DSR)

1. Formal Definition

2. Derivation and Integration Within Trade-R1

3. Training Pipeline in Stochastic Environments

4. Coupling Mechanism and Asymmetry

5. Theoretical Guarantees

6. Comparative Analysis with FSR

7. Empirical Performance and Ablation Findings

8. Significance and Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research