Entropy-Regularized DROPO (E-DROPO)

Updated 15 October 2025

Entropy-Regularized DROPO (E-DROPO) is a framework that integrates entropy regularization with distributionally robust optimization to achieve robust sim-to-real transfer.
It augments likelihood-based parameter estimation with a Shannon entropy bonus, preserving covariance and promoting efficient exploration in high-dimensional dynamics.
E-DROPO offers provable consistency and tighter sim-to-real error bounds compared to traditional uniform domain randomization methods in various control tasks.

Entropy-Regularized DROPO (E-DROPO) refers to a class of methods that integrate entropy-based regularization into Distributionally Robust Optimization via Optimal Transport (DROPO) principles, spanning both theoretical foundations and practical algorithmic schemes in reinforcement learning, optimal transport, robust control, and sim-to-real transfer. Entropy-regularization plays a dual role: it induces stochasticity to promote exploration or robustness, and it enables algorithmic tractability and improved sample efficiency, especially in high-dimensional regimes. The recent literature formalizes E-DROPO frameworks in both control/learning settings (such as entropy-regularized MDPs and RL) and in actuarial or transport settings (such as entropy-regularized Wasserstein distances), with applications to domain randomization, robust planning, simulation-to-reality transfer, and statistical estimation.

1. Formalization and Objectives

Offline Domain Randomization (ODR) is formalized as a maximum-likelihood estimation (MLE) problem over a parametric simulator family. The simulator parameters ξ are distributed according to a parametric form p_φ(ξ), typically a Gaussian 𝒩(μ, Σ). Given an offline dataset 𝒟={ (s,a,s′) } collected from the real system (whose ground-truth dynamics correspond to some unknown parameter ξ*), the goal is to choose parameters φ = (μ, Σ) maximizing the log-likelihood of observed transitions:

$\phi^* = \underset{\phi}{\arg\max} \sum_{(s,a,s')\in\mathcal{D}} \log \left( \mathbb{E}_{\xi\sim p_\phi} P_\xi(s' | s, a) \right)$

Entropy-regularized DROPO ("E-DROPO" [Editor's term]) augments this MLE objective with a Shannon entropy bonus on the parameter distribution:

$\text{Objective} = \mathcal{L}(\phi) + \beta H(p_\phi) \qquad H(p_\phi) = -\int p_\phi(\xi) \log p_\phi(\xi) d\xi$

where β > 0 is the regularization strength. For Gaussian p_φ, the entropy can be computed explicitly: $H(\mathcal{N}(\mu, \Sigma)) = \tfrac{1}{2} \log((2\pi e)^d \det \Sigma).$

The entropy term discourages variance collapse (Σ → 0) before the mean μ aligns with the true parameter ξ*, leading to broader, more representative randomization during policy training. This broadened exploration is crucial for zero-shot sim-to-real transfer performance.

2. Theoretical Guarantees: Consistency and Sim-to-Real Gap

The consistency of the ODR (and thus E-DROPO) estimator holds under mild assumptions:

Regularity: The simulator mapping $(\xi, s, a) \mapsto P_\xi(s'|s,a)$ is continuous/regular.
Compactness: The set of allowable simulator parameters is compact.
Mixture Positivity: The mixture of transition probabilities induced by any parameter distribution is strictly positive over the observed transitions.
Identifiability: Only the degenerate distribution at ξ* exactly recovers the true environment's transitions.

Under these conditions, the MLE estimate $\hat{\phi}_n$ (from n samples) converges in probability to φ* = (ξ*, 0):

$\hat{\phi}_n \xrightarrow{p} \phi^* \qquad \text{as}\ n\to\infty$

If the transition log-likelihood is Lipschitz-continuous in φ (uniformly), the convergence holds almost surely.

Gap bounds are derived by introducing the notion of "α-informativeness"—that is, the fitted parameter distribution eventually assigns at least α probability to any ε-ball around the true parameter. For finite-simulator settings with M candidate models, the sim-to-real gap for uniform DR scales as $O(M^3\log(MH))$ (H: planning horizon), while ODR enjoys bounds up to an O(M) factor tighter (e.g., $O(M^2\log(MH))$ or $O(\sqrt{MH\log(MH)})$ in closely related settings). This suggests that fitting a data-informed parameter distribution via ODR (and by extension, E-DROPO) sharply reduces sim-to-real error as compared to hand-tuned, data-agnostic randomization.

3. Comparison with Standard Domain Randomization

Traditional domain randomization (DR) samples simulator parameters uniformly over a preset range, often ignoring real-world data. This leads to overly conservative, wide-ranging randomization; as the possible simulator set size M increases, this approach suffers from rapidly worsening sim-to-real bounds. The key contrasts are:

Aspect	Uniform DR	ODR / E-DROPO
Parameter fitting	Uniform, fixed	MLE fit from real data
Variance behavior	Manually tuned	Data-driven; entropy regularized (broad by default)
Sim-to-real gap (finite M)	$O(M^3 \log(MH))$	Up to $O(M^2 \log(MH))$ or $\sqrt{MH\log(MH)}$
Adaptivity	None	Data-adaptive, robust

Empirically and theoretically, entropy-regularized ODR (E-DROPO) ensures that randomization remains broad enough to avoid collapse while still being tightly focused on realistic (data-supported) parameters.

4. Algorithmic Implementation and Role of Entropy

The practical E-DROPO update augments the parameter log-likelihood with an entropy bonus:

def edropo_update(D, p_phi, beta):
    # D: dataset of real transitions, p_phi: parameter distribution (Gaussian), beta: entropy weight
    def objective(phi):
        loglik = sum( np.log( np.mean([P_xi(s_prime | s, a) for xi in p_phi.sample()]) )
                      for (s, a, s_prime) in D )
        entropy = 0.5 * np.log( (2 * np.pi * np.e) ** d * np.linalg.det(phi.cov) )
        return loglik + beta * entropy
    # update phi by (stochastic) gradient ascent on objective
    phi = optimize(objective)
    return phi

The entropy regularizer keeps the learned parameter covariance Σ from collapsing to zero, enhancing exploration during policy training across the plausible dynamics range. Once trained, the policy generalizes more robustly under sim-to-real transfer.

Experimental results on both tabular control problems and continuous control tasks (e.g., Robosuite Lift) showed that E-DROPO achieves parameter means closer to ξ* (measured in L²), lower mean squared error, and improved zero-shot transfer compared to standard DROPO without entropy (Fickinger et al., 11 Jun 2025).

5. Practical Implications and Applications

E-DROPO is particularly effective for sim-to-real transfer in reinforcement learning for robotics and autonomous systems, where the agent's behavior must be robust to mismatches between training-time simulation and deployment-time real-world dynamics. Unlike hand-tuned uniform randomization, E-DROPO leverages real-world data to fit both the location and spread of simulator parameter distributions, providing:

Data-driven calibration: Simulator parameters are optimized to match empirically observed dynamics, rather than hand-set.
Entropy-mediated exploration: The entropy bonus prevents overfitting to finite data, maintaining robustness to model misspecification.
Provable transfer guarantees: E-DROPO enjoys gap bounds that are up to an O(M) factor tighter than uniform DR, making it more suitable for large simulator families and continuous parameter spaces.

Applications extend to dexterous manipulation, legged locomotion, and safety-critical systems where simulator fidelity and generalization are paramount.

6. Limitations and Considerations

The theoretical bounds and consistency guarantees of E-DROPO require basic regularity (continuity, identifiability) of the simulator family. In high dimensional or extremely multimodal parameter spaces, the practical convergence rate of the estimator and the effectiveness of the Gaussian approximation may become limiting factors. Additionally, maintenance of significant entropy (covariance) may sometimes limit the fine-tuning toward ξ*, potentially requiring careful adjustment of β as data volume increases.

A plausible implication is that adaptive schemes for tuning β, or non-Gaussian parameter distributions, may be needed in complex real-world environments.

7. Summary

Entropy-Regularized DROPO (E-DROPO), as formalized via entropy-augmented maximum-likelihood fitting of simulator parameter distributions, provides a theoretically grounded and efficient approach to domain randomization for sim-to-real reinforcement learning. By explicitly maximizing entropy alongside likelihood, E-DROPO yields data-driven, robust parameter distributions that ensure broad coverage of plausible dynamics while converging to the true system as more data is acquired. Rigorous consistency results and improved sim-to-real bounds highlight its advantages over traditional, uniform randomization, especially in high-dimensional or safety-critical applications (Fickinger et al., 11 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Provable Sim-to-Real Transfer via Offline Domain Randomization (2025)

Follow Topic

Get notified by email when new papers are published related to Entropy-Regularized DROPO (E-DROPO).