Papers
Topics
Authors
Recent
2000 character limit reached

Neural Fidelity Calibration (NFC)

Updated 14 December 2025
  • Neural Fidelity Calibration (NFC) is a framework for adaptive sim-to-real transfer in deep reinforcement learning, integrating online simulator calibration with residual uncertainty handling.
  • It decomposes the sim-to-real gap into calibrated simulator dynamics and residual fidelity, enabling targeted policy fine-tuning and robust anomaly detection in complex robotic environments.
  • Its use of conditional score-based diffusion models and sequential calibration strategies has been validated through high-dimensional experiments, demonstrating significant improvements in real-world robotic performance.

Neural Fidelity Calibration (NFC) is a framework for adaptive and informative sim-to-real transfer in deep reinforcement learning for robotic locomotion and navigation. NFC integrates online calibration of simulator physical coefficients and residual fidelity domains via conditional score-based diffusion models. The key innovation is joint inference of physical parameter correction and residual uncertainty attributable to perception limitations, thereby enabling realistic environment sampling and targeted policy adaptation. NFC is designed to perform online calibration during execution, fine-tune policies under detected anomalous scenarios, and manage transition uncertainty through optimistic exploration. Its efficacy has been validated across high-dimensional robotic platforms and challenging real-world environments, notably improving robustness in sim-to-real adaptation (Yu et al., 11 Apr 2025).

1. Formalism of Residual Fidelity

NFC explicitly decomposes the sim-to-real gap into calibrated simulator dynamics and residual fidelity which captures both model misspecification and perception uncertainty. The transition of real-world system state is modeled as: st+1real=fψ(st,atet)+gϕ(st,atet)+ϵts_{t+1}^{\rm real} = f_{\psi}(s_t,a_t\mid e_t) + g_{\phi}(s_t,a_t\mid e_t) + \epsilon_t where:

  • fψf_{\psi}: Black-box simulator dynamics with calibration parameters ψ\psi.
  • gϕg_{\phi}: Residual fidelity shift, parameterized by ϕ\phi, representing unmodeled dynamics and environmental uncertainty.
  • ϵt\epsilon_t: Zero-mean process noise.
  • ete_t: Onboard perceived environment state.

Residual fidelity parameters ϕ={Δs,Δe}\phi=\{\Delta s, \Delta e\} are defined as:

  • Residual dynamics: Δst=st+1realfψ(st,at)\Delta s_t = s_{t+1}^{\rm real} - f_{\psi}(s_t,a_t)
  • Residual environment: Δet=e^tet\Delta e_t = \hat e_t - e_t, with e^t\hat e_t reconstructed via diffusion.

Thus, the residual fidelity at time tt is: rt=(Δst,Δet)=(st+1realfψ(st,at),e^tet)r_t = (\Delta s_t, \Delta e_t) = (s_{t+1}^{\rm real} - f_{\psi}(s_t,a_t), \hat e_t - e_t) This joint representation allows the diffusion model to sample plausible physical and perceptual scenarios for subsequent calibration and policy improvement.

2. Conditional Score-Based Diffusion Modeling

NFC uses a conditional score-based diffusion model for inference over the posterior distribution of calibration (ψ\psi) and residual (ϕ\phi) parameters given observed trajectories τ\tau: q(ψ,ϕτ)q(\psi,\phi|\tau) The diffusion mechanism involves:

  • Forward SDE:

dx=f(x,t)dt+g(t)dwt\mathrm{d}x = f(x,t)\,\mathrm{d}t + g(t)\,\mathrm{d}w_t

Discrete: q(xtx0)=N(xt;αtx0,σt2I)q(x_t|x_0) = \mathcal{N}(x_t; \alpha_t x_0, \sigma_t^2I), with context cc set by trajectory and perception.

  • Reverse SDE: The backward-time evolution is governed by

dx=[f(x,t)g(t)2xtlogpt(xtc)]dt+g(t)dwˉt\mathrm{d}x = [f(x,t) - g(t)^2 \nabla_{x_t}\log p_t(x_t|c)]\mathrm{d}t + g(t)\,\mathrm{d}\bar w_t

The score function xtlogpt(xtc)\nabla_{x_t}\log p_t(x_t|c) is parameterized by a neural network sθs_{\theta}.

  • Training Objective:

L(θ)=Et,x0,ϵsθ(xt,tc)xtlogq(xtx0)2\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left\| s_{\theta}(x_t,t|c) - \nabla_{x_t}\log q(x_t|x_0) \right\|^2

which is solved via score-matching on synthetic samples.

This design enables NFC to stochastically sample physical and residual domains, adaptively responding to observed sim-to-real discrepancy.

3. Simulator Coefficient Calibration and Policy Adaptation

For each real trajectory τ\tau, simulator coefficients and residual domains are inferred by: (ψ,ϕ)q(ψ,ϕτ)(\psi,\phi) \sim q(\psi,\phi|\tau) The calibration workflow involves:

  • Sampling: Generating NN samples (ψi,ϕi)(\psi_i, \phi_i) to produce corresponding environments with updated physical and residual parameters.
  • Loss Function: Discrepancy minimization

Lfidelity=Eiτisimτreal2\mathcal{L}_{\rm fidelity} = \mathbb{E}_i \|\tau^{\rm sim}_i - \tau^{\rm real}\|^2

with Bayesian posterior maximization

ψ,ϕ=argmaxψ,ϕ1Ni=1Nlogq(ψi,ϕiτireal)\psi^*,\phi^* = \arg\max_{\psi,\phi} \frac{1}{N}\sum_{i=1}^N \log q(\psi_i,\phi_i|\tau_i^{\rm real})

  • Gradient Update: Parameters are refined via score-matching gradients.

This workflow bridges the sim-to-real gap by inferring optimal physics and environmental corrections for policy fine-tuning in the dynamically reconstructed domain.

4. Sequential Calibration and Online NFC Refinement

NFC employs sequential posterior construction to efficiently utilize accumulated calibration knowledge:

  • Prior proposal: p~(ψ,ϕ)=qold(ψ,ϕτ)\tilde p(\psi,\phi) = q_{\rm old}(\psi,\phi|\tau)
  • Posterior refinement:

qnew(ψ,ϕτ)qtrue(ψ,ϕτ)p~(ψ,ϕ)p(ψ,ϕ)q_{\rm new}(\psi,\phi|\tau) \propto q_{\rm true}(\psi,\phi|\tau) \cdot \frac{\tilde p(\psi,\phi)}{p(\psi,\phi)}

This process requires retraining only on a compact, online dataset, improving computational efficiency and calibration accuracy. The sequential NFC update algorithm is:

Step Operation
1. Sampling (ψi,ϕi)qold(ψ,ϕτj)(\psi_i, \phi_i) \sim q_{\rm old}(\psi,\phi|\tau_j)
2. Simulation Collect τijsim\tau^{\rm sim}_{ij}, compute ij=τijsimτj2\ell_{ij} = \|\tau^{\rm sim}_{ij} - \tau_j\|^2
3. Update Minimize Lfidelity=1MNj,iij\mathcal{L}_{\rm fidelity} = \frac{1}{MN}\sum_{j,i} \ell_{ij}; set qnewq_{\rm new}

This sequential framework maintains NFC adaptivity throughout deployment.

5. Anomalous Scenario Detection and Selective Policy Optimization

NFC includes an anomaly-driven fine-tuning protocol:

  • Detection: A causal-TCN encoder Fθ{\cal F}_\theta encodes control (τc\tau^{\mathfrak c}) and target (τt\tau^{\mathfrak t}) trajectories. An anomaly is flagged when

cos(Fθ(τc),Fθ(τt))<δanom\cos({\cal F}_\theta(\tau^{\mathfrak c}), {\cal F}_\theta(\tau^{\mathfrak t})) < \delta_{\rm anom}

with threshold determined via ROC analysis.

  • Selective Fine-Tuning: Policy adaptation via

π=argmaxπ  E(ψ,ϕ)q  Est+1pψ,ϕ[t=0Tγtr(st,at)]\pi^* = \arg\max_{\pi}\; \mathbb{E}_{(\psi,\phi) \sim q}\; \mathbb{E}_{s_{t+1} \sim p_{\psi,\phi}} [\sum_{t=0}^T \gamma^t r(s_t,a_t)]

is triggered only under detected anomalous conditions.

This targeted adaptation avoids unnecessary policy updates, focusing computational resources on informative scenarios.

6. Exploration Under Uncertainty via Hallucination

When NFC posterior uncertainty (Σ\|\Sigma\|) is large, the framework initiates hallucinated optimistic exploration: (ψ,ϕ)μ(ψ,ϕ)+Σ(ψ,ϕ)πh(st)(\psi,\phi) \sim \mu(\psi,\phi) + \Sigma(\psi,\phi)\cdot \pi_{\mathfrak h}(s_t) where μ\mu, Σ\Sigma are posterior mean and covariance, and πh\pi_{\mathfrak h} guides optimistic parameter sampling. The joint policy optimization is: π,πh=argmaxπ,πhE(ψ,ϕ)hallucinated[t=0Tγtr(st,at)]\pi^*, \pi_{\mathfrak h}^* = \arg\max_{\pi, \pi_{\mathfrak h}} \mathbb{E}_{(\psi,\phi)\sim{\rm hallucinated}} [\sum_{t=0}^T \gamma^t r(s_t,a_t)] This procedure suggests greater resilience to epistemic uncertainty in sim-to-real adaptation.

7. Evaluation and Empirical Findings

Calibration Accuracy

NFC demonstrates superior calibration precision for sim-to-real transfer. Average log-posterior scores (MDN vs SDE+TCN) for Ant, Quadruped, Humanoid, Quadrotor, and Jackal indicate consistent improvement across flat/rough terrains.

Anomaly Detection Results

Detection performance achieves TPR = 1.00 and FPR = 0.00 across five robotic systems, indicating high reliability for triggering targeted policy optimization.

Sim-Sim Policy Performance

Table A shows PPO+NFC outperforming PPP w/o ΔFidelity, PPO w/o NFC, and TD-MPC2 in normalized return for all benchmark robots:

Robot PPO w/ NFC w/o ΔFidelity w/o NFC TD-MPC2
Ant (F) 0.98 0.88 0.64 0.87
Ant (R) 0.95 0.81 0.57 0.75
Quad (F) 0.67 0.33 0.17 0.42
Quad (R) 0.55 0.27 0.18 0.18
Humanoid(F) 0.96 0.89 0.55 0.88
Humanoid(R) 0.93 0.90 0.58 0.88
Quadcopter(F) 0.82 0.75 0.67 0.72
Quadcopter(R) 0.81 0.69 0.33 0.67
Jackal (F) 0.87 0.67 0.23 0.41
Jackal (R) 0.82 0.69 0.20 0.41

Real-World Navigation: ClearPath Jackal

Empirical results for the ClearPath Jackal under flat and rough conditions (broken axle, snow, rocks), summarized in Table B, show NFC-driven PPO achieves highest success rates and lowest orientation/positional jerk under challenging conditions:

Condition Method Success % Traj-ratio Ori-jerk Pos-jerk
Flat PPO w/ NFC 100 1.8 2.98 4.36
Flat PPO w/o ΔF 100 1.8 3.50 9.32
Flat PPO w/o NFC 100 1.9 14.18 8.41
Flat TD-MPC2 100 1.5 6.95 5.27
Flat Falco 100 1.6 41.9 6.95
Rough PPO w/ NFC 72 2.1 3.19 6.26
Rough PPO w/o ΔF 53 3.2 8.49 12.46
Rough PPO w/o NFC 25 3.7 26.99 12.42
Rough TD-MPC2 42 2.7 8.98 18.46
Rough Falco 28 3.6 41.93 8.87

Taken together, these findings substantiate NFC’s benefits in sim-to-real policy transfer, adaptive calibration, targeted fine-tuning, and robust operation under uncertainty (Yu et al., 11 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Neural Fidelity Calibration (NFC).