Sim-vs-Real Correlation Coefficient (SRCC)
- SRCC is a metric that quantifies how well simulation rankings predict real-world performance using Pearson or Spearman correlations, evaluating metrics such as success rate and SPL.
- It is employed in embodied AI and robotics to assess sim-to-real transfer by comparing performance across varied models, environments, and noise configurations.
- High SRCC values validate simulation fidelity and inform simulator tuning, while low SRCC values highlight potential exploits or discrepancies in modeling.
The Sim-vs-Real Correlation Coefficient (SRCC) is a quantitative metric for assessing the extent to which comparative performance of models or policies in simulation is predictive of their comparative performance after transfer to a real-world robotic platform. SRCC is central to frameworks for evaluating and tuning the sim-to-real (sim2real) gap in embodied AI, robotics, and navigation. It operationalizes the question: if one method is superior to another in simulation, how reliably does that superiority persist in real-world deployment?
1. Mathematical Definition
SRCC is most commonly instantiated as the empirical Pearson or Spearman correlation between performance metrics measured in simulation and the same metrics measured under corresponding real-world conditions. For navigation models (or evaluation conditions), let %%%%1%%%% denote the performance of model in simulation, and its counterpart in reality. Define vectors
Then the SRCC is computed as the sample correlation:
where and are the means of the respective vectors (Kadian et al., 2019).
Alternatively, some studies employ the Spearman rank-correlation version:
where is the difference in ranks between and (Chhablani et al., 22 Sep 2025). By construction, SRCC lies in : SRCC near +1 implies that the simulation ranking is predictive of the real-world ranking; SRCC near 0 implies no relationship; SRCC near would mean that simulation rankings are inverted in reality.
2. Construction of Performance Vectors
The selection of and is governed by the task and evaluation protocol:
- Performance Metric: In navigation, standard metrics include Success Rate (fraction of episodes in which the agent stops within a preset tolerance of the goal) and SPL (Success weighted by Path Length).
- Aggregation Over Episodes: For each model, performance is averaged over a grid of test episodes, typically covering:
- Multiple environment layouts (e.g., obstacle configurations of varying difficulty)
- Multiple start-goal pairs per layout
- Several stochastic seeds or repeats per pair
- Simulation-Real Pairing: Each arises from simulation trials under a given model, and from its real-world counterpart under matched evaluation conditions.
For example, in the PointGoal navigation case (Kadian et al., 2019), each model is evaluated over 45 simulation episodes and 45 real robot episodes across three rooms, five start-goal pairs, and three repeats.
3. Experimental Methodologies
The design of experiments underlying SRCC computation aims to ensure strict alignment between simulation and reality:
- Unified Code Execution: Platforms such as the Habitat–PyRobot Bridge (“HaPy”) enable identical agent code, sensors, action-space APIs, and configurations to be executed in both simulated and physical environments by toggling a backend parameter. This ensures parity in evaluation.
- Environment Replication: Real-world environments are densely 3D scanned (e.g., via Matterport Pro2 or consumer mobile devices for Gaussian Splatting or Polycam) to produce simulation meshes that closely match the physics and semantics of deployment spaces (Kadian et al., 2019, Chhablani et al., 22 Sep 2025).
- Variation Across Models: Diverse agent architectures, sensor modalities, and training-time settings (e.g., actuation noise, collision handling) are tested to populate the evaluation set.
- Data Points: In some studies, data points correspond to different navigation models (Kadian et al., 2019); in others, to policy/mesh type combinations (Chhablani et al., 22 Sep 2025).
4. Empirical SRCC Results and Interpretation
SRCC behaves as a direct indicator of sim2real predictivity. Empirical values vary with environment fidelity, agent modeling, and simulator configuration.
| Scenario | SRCC Value | Notes |
|---|---|---|
| Habitat-Sim default (SLIDING ON, noise 0.0), SPL | 0.603 | Many rank-reversals, moderate predictivity |
| Habitat-Sim default, Success | 0.18 | Low predictivity, poor sim-real transfer |
| Simulator tuned (SLIDING OFF, noise 0.0), SPL | 0.875 | High predictivity after exploit removal |
| Simulator tuned, Success | 0.844 | Strong sim-real agreement post-tuning |
| EmbodiedSplat - DN-Splatter mesh | 0.87 | Spearman correlation across six policies |
| EmbodiedSplat - Polycam mesh | 0.97 | Spearman correlation; nearly perfect rank preservation |
| EmbodiedSplat - Combined | ~0.93 | Pooling both mesh types and policies |
Low SRCC reveals failure modes in simulation—such as agents learning to exploit simulator-specific artifacts (collision “sliding,” unrealistic friction, or unphysical path accounting)—that do not carry over to real-world hardware. After simulator tuning (e.g., disabling sliding, matching actuation noise), SRCC values increase dramatically, indicating restoration of sim2real predictivity (Kadian et al., 2019). Mobile-device mesh pipelines (e.g., Polycam, DN-Splatter) yield SRCC in the range 0.87–0.97, substantiating that such lightweight reconstruction techniques can support reliable sim2real policy fine-tuning (Chhablani et al., 22 Sep 2025).
5. Methodological Differentiation and Generalization
The methodology for evaluating SRCC varies in correlation metric (Pearson versus Spearman), scope of models or policy variants compared, and the granularity of real-sim matches:
- Pearson SRCC is sensitive to linear correspondence of scores; Spearman SRCC focuses on rank-order agreement.
- Data points may represent (a) separate models trained using different sensor suites and noise levels (Kadian et al., 2019), or (b) combinations of pretraining source, fine-tuning environment, and mesh type (Chhablani et al., 22 Sep 2025), depending on the experimental question.
- Episode Matching: For robust estimation, both simulation and real-world tests are conducted over prescribed episode sets, with performance aggregated across stochastic runs.
No formal statistical significance analysis (p-values or confidence intervals) accompanies SRCC reporting in these studies; reported values are empirical estimations based on the available model or condition count.
6. Limitations, Common Pitfalls, and Best Practices
Pitfalls impacting SRCC:
- Simulator-specific exploits (collision “sliding,” unmodeled contact physics) that models can use to artificially boost simulated scores, yielding inflated or misleading SRCC.
- Lack of realistic noise modeling (actuation, sensor) or unmodeled real-world variability, leading to rank inversions or decoupled performance.
- Insufficiently accurate environment reconstructions, introducing semantic or geometric discrepancies between sim and real testbeds.
Best practices include:
- Running a parallel testbed of sim/real model execution and computing SRCC before relying on simulation-only leaderboards.
- Disabling all simulation features unavailable on the real system.
- Injecting realistic noise (actuation/sensor), and reconstructing deployment environments via modern 3D scanning.
- Treating SRCC as an objective and tuning simulator parameters to maximize it before inferring generalization from sim results (Kadian et al., 2019).
7. Significance and Implications for Embodied AI
A high SRCC validates the use of simulation as a predictive tool for model selection and development: improvements in simulation indicate real-world performance gains, reducing experimental burden. In the context of lightweight scene acquisition (e.g., via consumer mobile devices and Gaussian Splatting), a strong SRCC supports rapid personalization and deployment of navigation policies with minimal real-world iteration (Chhablani et al., 22 Sep 2025). Conversely, a low or unstable SRCC signals a need to re-examine simulator fidelity, modeling assumptions, or transfer protocol before drawing conclusions from simulation-based experiments.
SRCC thus serves as an essential metric for calibration, benchmarking, and confidence estimation in sim2real research pipelines, guiding both the design of simulation environments and evaluation strategies.