DRPO: Domain-aware Relative Policy Optimization

Updated 9 December 2025

DRPO is a reinforcement learning objective that extends critic-free policy optimization with adaptive, hierarchical reward scaling and explicit domain awareness.
It clusters rewards by domain and task difficulty, applying temperature scaling to upweight rare and complex samples for improved stability.
Empirical results show DRPO achieves significant performance gains in clinical multimodal and policy transfer tasks while maintaining computational efficiency.

Domain-aware Relative Policy Optimization (DRPO) is a reinforcement learning (RL) objective that extends critic-free policy optimization via adaptive, hierarchical reward scaling with explicit domain-awareness. DRPO is designed to address training instability and performance skew that arise in settings with data heterogeneity across domains and task difficulty, such as clinical multimodal instruction-following and cross-environment policy transfer. DRPO appears in distinct but convergent instantiations: as a temperature-scaled, reward-grouped advantage reweighting for critic-free RLHF in multimodal LLMs (Dai et al., 31 May 2025), and as a theoretical and algorithmic framework for domain-aware policy transfer in RL, forming a core part of the Relative Policy-Transition Optimization (RPTO) algorithm (Xu et al., 2022).

1. Theoretical Foundation and Objective

DRPO in the RL literature builds on the notion of value relativity between policies and domains (MDPs). For two environments—source $\mathcal{E}$ with kernel $P$ and target $\mathcal{E}'$ with $P'$ —the relativity gap is formalized by

$J(P',\pi) - J(P,\pi')$

where $J(P,\pi)$ is the expected discounted return under policy $\pi$ in environment $P$ .

Domain-aware RPO (Relative Policy Optimization) decomposes this as a “dynamics gap” and a “policy gap”:

$J(P',\pi)-J(P,\pi^*) \approx \mathbb{E}_{s,a,s’ \sim P’,\pi} [r(s,a,s') + \gamma V^{P,\pi^*}(s')] - C(\delta_1,\delta_2)$

where $C(\delta_1,\delta_2)$ quantifies the impact of domain divergence ( $\delta_1$ ) and policy divergence ( $\delta_2$ ) via explicit total-variation bounds.

The DRPO surrogate loss thus takes the form

$L_{\text{RPO}}(\theta) = \mathbb{E}_{s \sim d^{P',\pi_{\text{old}}},\, a \sim \pi_\theta,\, s' \sim P'} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \left( r(s,a,s') + \gamma V^{P,\pi_{\text{old}}}(s') \right) \right]$

with a PPO-style KL regularization and (optionally) entropy bonus for numerical and monotonicity guarantees (Xu et al., 2022).

2. Hierarchical Reward Scaling and Normalization

In large-scale, instruction-tuning settings, DRPO introduces a two-stage hierarchical temperature scaling based on (i) domain rarity and (ii) modality or task difficulty (Dai et al., 31 May 2025). For each training iteration:

Prompts are grouped by domain $g$ (e.g., "Chest X-ray", "ECG").
For each domain, per-prompt reward vectors are clustered via K-means (elbow method), yielding intra-domain clusters reflecting difficulty strata.
Temperature factors are computed:

$T_{(g,t)} = \max\left( \sqrt{N_{(g,t)} \cdot \mu_{(g,t)}},\,\varepsilon \right)$

$T_{(c,g,t)} = \max\left( \sqrt{N_{(c,g,t)} \cdot \mu_{(c,g,t)}},\,\varepsilon \right)$

where $N$ denotes prompt count (rarity metric) and $\mu$ average reward (difficulty). High $N$ (common) and high $\mu$ (easy) samples get larger $T$ and thus smaller weight, while rare or difficult samples are upweighted.

Final advantage scaling per sample is:

$s^{\text{scaled}}_{(q,i,t)} = m_{(q,i,t)} \frac{s_{(q,i,t)}}{T_{(g(q),t)} T_{(c(q),g(q),t)}}$

with $m_{(q,i,t)}$ a KL-aware damping factor, and $s_{(q,i,t)}$ is GRPO-normalized reward.

Renormalization ensures unit variance across the minibatch, yielding the DRPO advantage $\hat A^{\mathrm{DRPO}}$ for the policy gradient surrogate.

3. Algorithmic Workflow

DRPO is instantiated as follows in group-based RLHF:

Initialize policy $\pi_\theta$ (pretrained) and reference policy $\pi_{\mathrm{ref}}$ .
For each batch and domain:
- Sample $K$ responses per prompt; compute rewards.
- Normalize rewards per prompt (GRPO).
- Cluster within domains by reward vector.
- For each domain and cluster, compute temperature factors.
- For each sample, assign scaled advantage with KL-damping.
- Renormalize advantages across batch.
- Update policy by computing clipped surrogate objective with KL regularization.
Update $\theta$ ; repeat for $T$ iterations.

This preserves the sample/adaptivity benefits of critic-free RL while introducing minimal $O(K)$ clustering overhead.

In RL transfer, DRPO interleaves experience collection in both source and target environments:

Policy is updated using batch-targeted surrogates which combine value and reward estimation in both domains.
When integrated with Relative Transition Optimization (RTO), model transition dynamics are updated to match the new environment, yielding RPTO (Xu et al., 2022).

4. Key Hyperparameters and Implementation Details

Critical hyperparameters from clinical foundation model application (Dai et al., 31 May 2025) include:

Clusters: elbow selection, tolerance $\tau=0.10$ , $k_\text{max} = \min(10, N_\text{unique})$
KL-aware damping percentile $p=0.90$
Normalization/temperature floor $\varepsilon \sim 10^{-6}$
GRPO clip $\epsilon=0.2$ , KL weight $\beta = 1\times 10^{-4}$
Reward composition: F1 (0.6), IoU (0.2), format/coherence (0.2)
Optimization: AdamW, learning rate $10^{-6}$ , batch size per device 4, rollout batch 512, context 8192, mixed precision
Hardware: 8 $\times$ A100 for 7B model, 4 $\times$ H200 for 32B
Distributed via FSDP + vLLM infrastructure

For policy transfer, DRPO alternates policy updates from source/target domain with a fixed ratio, with model-based RL components (ensemble NNs, horizon 1) for RTO (Xu et al., 2022).

5. Empirical Results and Ablations

Applications demonstrate DRPO’s efficacy:

In clinical multimodal models (QoQ-Med), DRPO yields a 43% average macro-F1 boost across 8 vision domains over critic-free GRPO, with largest relative gains in rare/complex modalities (ultrasound, mammography). On segmentation, DRPO-trained models achieve IoU 10 $\times$ higher than open baselines, matching OpenAI o4-mini (Dai et al., 31 May 2025).
Ablations show: domain-only scaling gives +29.5% F1 over baseline; clustering adds +10.4%; KL damping +1.6%.
On synthetic balanced data, DRPO maintains advantages over static resampling, up-/down-weighting, or focal loss, confirming the benefits of hierarchical, reward-driven adaptivity.
In continuous control policy transfer, DRPO alone outperforms SAC-warm and TRPO-warm for mild domain shifts; full RPTO (DRPO+RTO) enables robust transfer and rapid target domain learning under large dynamics gaps, outperforming MBPO, SLBO, PDML, and PTP (Xu et al., 2022).

Reward computation and overhead remain negligible (under 2% per optimization step).

6. Applicability, Insights, and Limitations

DRPO is particularly suited for regimes with highly imbalanced data distributions or task difficulty:

Prevents overfitting to abundant/easy domains by amplifying gradient signal from rare/critical or harder tasks.
Maintains robustness even when data are artificially balanced, due to its dynamic clustering and scaling adapting to residual difficulty and task drift.
Avoids variance and computational cost of value networks (critic-free), while reintroducing per-sample adaptivity in gradient weighting.

In domain transfer, DRPO’s coupling to both policy and dynamics modeling ensures monotonic improvement and explicit domain-preservation properties, with theoretical error bounds as a function of domain and policy divergence (Xu et al., 2022).

A plausible implication is that DRPO generalizes beyond clinical AI and control scenarios to any large-scale RL scenario with non-uniform data, domain-skew, or evolving reward landscapes. No evidence is provided regarding performance in small-data or extremely high-noise regimes.

DRPO is built atop Group Relative Policy Optimization (GRPO) but replaces the static per-prompt normalization with adaptive, data-driven scaling.
Compared to standard DPO, DRPO introduces hierarchical temperature scaling without recourse to a value network, distinct from critic-driven advantage estimation.
In control, DRPO is tightly integrated with Relative Transition Optimization (RTO), yielding the RPTO algorithm for simultaneous policy and model transfer (Xu et al., 2022).
Empirical evaluations situate DRPO as superior in data efficiency, asymptotic performance, and rare-domain emphasis relative to both classical and contemporary baselines (e.g., PPO-style, SAC, DPO/GRPO, focal loss, static resampling, MBPO, SLBO, PDML, PTP).
Temperature-scaling and hierarchical clustering distinguish DRPO’s reward normalization from earlier RLHF and policy transfer strategies, providing both theoretical and practical guarantees of domain-aware adaptivity.

DRPO represents an evolution in reinforcement learning methodology, generalizing critic-free optimization to heterogeneity-aware, reward-scaled policy improvement suitable for both multimodal foundation models and cross-domain policy transfer under domain shift (Dai et al., 31 May 2025, Xu et al., 2022).