Frequency-Aware Reward Functions

Updated 23 November 2025

Frequency-aware reward functions are reinforcement learning mechanisms that adjust rewards based on state-action visit frequencies, offering improved robustness in uncertain environments.
They employ methods such as non-rectangular uncertainty sets and spectral reward decomposition to stabilize training and mitigate issues from progressive reward scales.
Empirical studies demonstrate that these approaches boost worst-case performance and maintain consistency across benchmarks like Atari games and continuous control tasks.

A frequency-aware reward function refers to a class of reinforcement learning (RL) reward mechanisms and objectives that explicitly regularize or decompose reward contributions according to statistics—typically, frequencies—over state-action visits or the distributional structure of rewards. This design addresses problems related to robustness under adversarial or uncertain reward specification, as well as training instability when the magnitude of rewards varies widely. Two distinct, rigorously defined frequency-aware reward paradigms are represented by (1) frequency-regularized robust Markov decision processes (MDPs) with non-rectangular reward uncertainty sets (Gadot et al., 2023), and (2) spectral reward decomposition in value-based RL to handle progressive rewards (Dann et al., 2021).

1. Motivation and Problem Settings

Frequency-aware reward approaches emerge primarily from two settings:

Robust MDPs with reward uncertainty: Standard robust MDPs assume worst-case rewards varying independently across states (s-rectangular uncertainty), which induces algorithms that are conservative due to the artificial decomposition of uncertainty. Replacing this with a coupled, non-rectangular uncertainty set (defined as an $L_p$ -ball centered at a nominal reward function) permits joint variation across states, coupling the adversary's choice globally (Gadot et al., 2023).
Progressive reward environments: Many RL benchmarks exhibit rewards whose magnitude grows over time (progressivity), such that late, high-magnitude rewards overwhelm learning, causing catastrophic forgetting of the policies needed for low-reward or earlier regions (Dann et al., 2021).

Both motivate frequency-aware mechanisms: in the first, occupancy-based regularization tightens policy sensitivity to frequented pathways under non-rectangular uncertainty, while in the second, spectral reward decomposition ensures the training loss remains balanced across the temporal and magnitude “frequency” structure of rewards.

2. Non-Rectangular Reward Robustness and Frequency Regularization

In robust MDPs with a fixed transition kernel and reward uncertainty set $R_p = \{ r: \|r - r_0\|_p \leq \alpha \}$ , the worst-case robust return for a stationary policy $\pi$ is

$V^{R_p}(\pi) = \min_{r \in R_p} \sum_{s,a} d_\pi(s,a) r(s,a),$

where $d_\pi$ denotes the discounted occupancy measure under $\pi$ . By writing $r = r_0+\Delta$ , this becomes

$V^{R_p}(\pi) = \sum_{s,a} d_\pi(s,a) r_0(s,a) - \alpha \|d_\pi\|_q,$

with $q$ the Hölder dual of $p$ . Thus, the robust optimization is exactly equivalent to maximizing the nominal return subject to a penalty on the $q$ -norm of the occupancy measure:

$\max_\pi \big[ \langle d_\pi, r_0 \rangle - \alpha \|d_\pi\|_q \big].$

This frequency-regularizer constrains not individual state rewards but the overall frequency with which state-action pairs are visited, providing a global coupled regularization that directly reflects the adversarial structure of the non-rectangular uncertainty set (Gadot et al., 2023). This “frequency-aware” penalty is absent from models employing s-rectangular uncertainty, in which regularization is localized and leads to overly conservative policies.

3. Policy Gradient and Implementation under Frequency-Aware Regularization

Defining the regularized objective as

$J(\pi) = \langle d_\pi, r_0 \rangle - \alpha \|d_\pi\|_q,$

the policy gradient is given by

$\nabla_\theta J(\pi_\theta) = \sum_{s,a} d_\pi(s) Q^{\pi}_{worst}(s,a) \nabla_\theta \pi(a|s),$

where $Q^{\pi}_{worst}$ is the standard state-action value function under the most adverse (worst-case) reward function, explicitly

$r^{{\pi}}_{worst}(s,a) = r_0(s,a) - \alpha \left( \frac{d_\pi(s,a)}{\|d_\pi\|_q} \right)^{q-1}.$

Algorithmic implementation involves:

Estimating $d_\pi$ via fixed-point iteration $d \leftarrow \mu + \gamma P^\pi d$ .
Evaluating robust Q-values using TD or multi-step methods under $r^{{\pi}}_{worst}$ .
Updating policy parameters $\theta$ via the likelihood-ratio gradient using $Q^{\pi}_{worst}$ .

Under smoothness assumptions for $J(\pi)$ , standard projected gradient ascent converges to $\varepsilon$ -stationary points in $O(1/\varepsilon)$ steps. The occupancy-norm regularizer maintains $\beta$ -smoothness for $1 < p < \infty$ because $d_\pi$ is itself a smooth function of $\pi$ (Gadot et al., 2023).

4. Spectral Reward Decomposition for Progressive Reward Tasks

A second paradigm for frequency-awareness is spectral reward decomposition as employed by Spectral DQN (Dann et al., 2021). Here, each scalar reward $R_t$ is expressed as a sum over $N+1$ “frequency” components:

$R_t = \sum_{i=0}^N b^i R^{(i)}_t,$

with $b > 1$ a base (typically $b=2$ ), and $R^{(i)}_t \in [-1,1]$ computed as

$R_t^{(i)} = \mathrm{sign}(r) \cdot \text{clamp}\left( \frac{|r| - \frac{b^i-1}{b-1}}{b^i} , 0, 1 \right),$

where large-magnitude rewards activate higher frequency components. The DQN architecture is augmented with $N+1$ output heads $Q_\theta(s,a,i)$ , each predicting value for the $i$ -th frequency band.

For each frequency, per-head TD targets and losses are computed, and the final loss is a variance-weighted sum:

$\mathcal{L}(\theta) = \sum_{i=0}^N \frac{1}{2\sigma_i^2} [y^{(i)} - Q_\theta(s,a,i)]^2,$

where $\sigma_i$ tracks the running standard deviation of each frequency’s TD target. Losses are normalized to balance optimization pressure across frequencies, ensuring that high-magnitude, sparse events do not dominate learning. This method was shown to outperform reward clipping and alternative normalizations in domains exhibiting progressive rewards (Dann et al., 2021).

5. Empirical Results and Practical Significance

Frequency-aware reward functions, both as robustness regularizers and as spectral decompositions, have demonstrated empirical benefits:

In robust MDPs with Gaussian reward noise (tabular settings), the frequency-regularized (non-rectangular) approach attains substantially higher worst-case CVaR $_{5\%}$ across draws, and is less sensitive to increasing $\alpha$ than conservative s-rectangular baselines (Gadot et al., 2023).
In continuous control (e.g., MountainCarContinuous-v0, Ant-v3), frequency-regularized policies retain higher mean returns under previously unseen, adversarial reward perturbations, outperforming PPO and domain randomization (Gadot et al., 2023).
In RL tasks with progressive rewards (e.g., Ponglantis, Exponential-Pong), spectral DQN transitions to high-reward phases and exceeds human-level performance, while ablations of frequency-based loss weighting collapse performance (Dann et al., 2021).
On standard Atari benchmarks, spectral DQN matches or outperforms established normalization methods in $5/6$ games. Removing the variance-based balancing over frequencies degrades performance in the majority of cases (Dann et al., 2021).

6. Connections to Broader Methodologies

Frequency-aware reward function design directly relates to regularization under global model uncertainty and addresses issues missed by naïve, per-state robustness or scalar reward normalization. In robust MDPs, frequency regularization arises as a dual-to-adversarial, non-rectangular reward perturbation and is distinct from standard, rectilinear (per-state) uncertainty, connecting to the literature on coupled uncertainty modeling and smooth occupancy-measure optimization.

In value-based RL, frequency-aware decompositions relate to multi-scale representation learning: spectral DQN’s per-frequency losses share structural similarities with Pop-Art normalization but go further by partitioning both magnitude and temporal structure of reward contributions. The approach preserves the ordering and representation of all reward scales, offering a principled alternative to reward clipping and target-compression methods.

7. Implementation and Theoretical Guarantees

Implementing frequency-aware regularization requires joint estimation of occupancy and Q-values, with policy iteration based on worst-case adversarial rewards computed as explicit functions of the estimated occupancy distribution (Gadot et al., 2023). For spectral DQN, implementation requires multi-head network outputs, spectral decomposition of rewards, variance tracking per head, and gradient normalization to stabilize learning (Dann et al., 2021).

Theoretical guarantees include global convergence of projected gradient ascent to $\varepsilon$ -stationary points for frequency-regularized robust MDPs, attributed to smoothness of the regularizer and occupancy mapping (Gadot et al., 2023). In practice, the use of frequency-aware losses or regularizers has been shown to improve robustness, stability, and adaptability in both adversarially perturbed and numerically challenging RL environments.