Sim-to-Real Visual Policies

Updated 5 June 2026

The paper demonstrates that GAN-based ensemble image translation can bridge the sim-to-real gap by reducing visual artifacts to as low as 6.5%.
The methodology integrates pixel-level domain adaptation, deep RL pipelines (e.g., PPO in a POMDP setup), and modular perception-control architectures to overcome dynamic mismatches.
Implications include improved task success rates (e.g., 50% in tissue retraction, 80% in grasping) while reducing reliance on costly real-world data collection.

Sim-to-real transfer with visual policies refers to the process of learning visuomotor policies in simulation and successfully deploying them on real-world robotic systems, overcoming the substantial domain gap between synthetic and real visual observations. This paradigm is critical for robotics domains, such as manipulation and surgery, where real-world data collection is costly or impractical and simulation offers unlimited interaction at low risk. Bridging the “sim-to-real gap” requires techniques that address visual discrepancies, dynamics mismatches, and robustness to domain and distributional shifts.

1. Pixel-Level Visual Domain Adaptation

Pixel-level domain adaptation targets the core obstacle in vision-based sim-to-real: the severe distribution mismatch between simulated and real images. Among leading methodologies, unpaired image-to-image translation using generative adversarial networks (GANs)—notably CycleGAN, CUT, DCL, and specialized models like StyleID-CycleGAN (SICGAN)—has become foundational.

A representative approach employs a generator $G : O_s \rightarrow O_r$ to map simulated observations to realistic imagery. Architectures typically involve encoder-decoder backbones with skip connections and residual blocks. Patch-level discriminators (PatchGAN) operate on local image patches, improving high-frequency realism. Loss functions are hybrid, combining adversarial losses, patchwise contrastive loss ( $L_{\text{NCE}}$ or PatchNCE), and identity losses to stabilize the mapping and avoid degenerate solutions.

For robust deployment, ensemble techniques are used to aggregate multiple GANs, introducing stochasticity akin to visual domain randomization. In Scheikl et al., an ensemble of seven DCL generators was found to yield real-like frames with minimal hallucination artifacts (~6.5% for DCL; higher for CUT and vanilla CycleGAN), which dramatically reduced failure inducing artifacts (Scheikl et al., 2024). This ensemble is frozen for downstream policy training.

RetinaGAN extends basic image translation by enforcing object-detection consistency using a frozen detector. The resultant perception consistency loss, using a focal loss on bounded box regression and class scores, ensures that visual translation does not dismantle object-level semantics, a crucial property for manipulation and pose-driven policies (Ho et al., 2020).

SICGAN further refines CycleGAN by incorporating demodulated convolutions (drawing from StyleGAN2), reducing batch norm–induced artifacts, and utilizing identity losses for color and semantics preservation, yielding robust zero-shot transfer on robotic manipulators (Güitta-López et al., 23 Jan 2026).

2. Reinforcement Learning Pipelines for Visual Policies

Deep RL algorithms, primarily Proximal Policy Optimization (PPO), underlie the training of visual policies. The entire data pipeline capitalizes on pixel-level or low-level observations generated by the translation networks.

The RL component is typically structured as a Partially Observable Markov Decision Process (POMDP), with agents receiving image stacks and producing velocity or position deltas as continuous actions. Network architectures segment into perception (convolutional front-ends) followed by policy and value heads (multi-layer perceptrons or LSTMs for temporal tasks). Sim-to-real policies may see only the last few frames (downsampled to reduce overfitting and computational cost), e.g., four stacked $256 \times 256$ RGB images are compressed to an $84 \times 84 \times 12$ tensor (Scheikl et al., 2024).

Reward functions are task-specific but penalize unsafe actions (e.g., excessive tissue stress in surgical manipulation (Scheikl et al., 2024)), workspace violations, and collisions, often following a shaped curriculum to teach safe exploration early on.

Value and policy losses are computed at each RL update, and entropy bonuses are standard for exploration.

3. Training, Domain Randomization, and Transfer Protocols

Effective sim-to-real training pipelines combine image translation, RL, and domain randomization. Randomizing simulator parameters—textures, lighting, object color, camera intrinsics/extrinsics—extends the distribution of simulated observations to cover real-world conditions, shown to be critical for robust sim-to-real transfer in visually diverse manipulation tasks (Garcia et al., 2023). Optimal domain randomization parameters may be discovered using offline proxy tasks, such as cube localization, which correlate closely with downstream policy success.

Policy training follows massive rollouts (e.g., 10 million steps for tissue retraction (Scheikl et al., 2024), or 1 million on-policy episodes for grasping (Ho et al., 2020)), usually with multiple parallel environment instances to maximize sample throughput.

During deployment, policies trained on domain-translated or randomized simulation frames operate directly on real sensor streams with no further fine-tuning. Performance metrics include strict task success (full completion), number of collisions, path length, timeouts, and partial successes (e.g., able to grasp but not complete full retraction).

Sample empirical results:

On robotic tissue retraction, a visual PPO policy, trained entirely in sim with GAN-translated images and no further real-world retraining, achieved a 50% real-world success rate (Scheikl et al., 2024).
RetinaGAN-augmented RL and IL pipelines provided 80% grasp success (over vanilla RL's 19%) and achieved over 90% success in object pushing and ensemble transfer settings (Ho et al., 2020).

4. Modular and Decoupled Architectures

A key structural advance is decoupling perception and control, allowing privileged-state RL policy optimization in simulation, followed by a “visual bridge” trained in the real world with small numbers of demonstrations (10–20 real demos suffices for outperformance over full end-to-end approaches) (Huang et al., 30 Sep 2025). The perception module often leverages powerful vision transformer backbones pretrained on large-scale data (e.g., DINOv2) with minimal parameter count, mapping real observations into the pre-defined control state space. This modular design brings two advantages:

Reduces the data cost for adapting to new domains—supervised regression of perception bridge is far cheaper than end-to-end RL on real robots.
Enables natural generalization to spatially out-of-distribution scenarios, as the control policy is trained under broad physical and geometric randomization.

Similar modularization is seen in earlier MDQN pipelines (Zhang et al., 2016), and adversarial discriminative sim-to-real transfer, where perception and control modules are separately aligned and then fine-tuned, yielding near state-based control accuracy on real hardware (Zhang et al., 2017).

5. Extension to Other Modalities and Real-to-Sim-to-Real Pipelines

Pixel-level visual transfer frameworks generalize to modalities beyond RGB, including dense depth (e.g., drone navigation using a Variational Autoencoder with adversarial domain alignment to map real stereo depth into simulated latent space (Yu et al., 18 May 2025)) and tactile images (real-to-sim GANs for optical tactile arrays, enabling direct policy transfer for edge following, surface tracking, and manipulation (Church et al., 2021)).

Increasingly, real-to-sim-to-real pipelines use photogrammetric scene reconstruction (3D Gaussian Splatting, NeRFs) to generate digital twins, enabling photorealistic simulation with egocentric observations for navigation, legged locomotion, and complex manipulation (Zhu et al., 3 Feb 2025, Dan et al., 11 May 2025). RL policies trained in these environments transfer using the same protocols, provided strict camera calibration and careful scene alignment.

Diffusion-based data generation, such as RoboTransfer, augments visual domains to include novel object-background combinations, closing the domain gap in out-of-distribution settings and substantially improving policy generalization (e.g., 251% gain on fully out-of-domain scenarios relative to training on real data only (Liu et al., 29 May 2025)).

6. Limitations, Failure Modes, and Future Directions

Visual sim-to-real transfer pipelines universally face several limitations:

Heavy reliance on fixed or calibrated camera viewpoints, as image translation networks are often not robust to unseen camera poses (Scheikl et al., 2024, Güitta-López et al., 23 Jan 2026).
Realistic simulation of complex dynamics, contact, or sensor noise may not be achievable, yet sim-to-real policies can be brittle to such unmodeled discrepancies (Scheikl et al., 2024).
GAN-based and diffusion models may hallucinate artifacts, especially outside the training domain; ensemble methods and additional object-/segmentation-aware losses partially alleviate this.

Assumptions often include the availability of large unpaired datasets for image translation, privileged state information for RL training in simulation, or high-fidelity object detectors for cycle consistency. For deformable object manipulation tasks, accurate FEM-based sim modeling is essential but rarely complete.

Robustness to physical and visual distributional shifts in the real world remains a guiding principle. Integration of viewpoint-invariant translation, semi-supervised training, multimodal perception, and self-supervised sim camera learning are key research directions for future policy robustness and autonomy.

References:

Scheikl et al., "Sim-To-Real Transfer for Visual Reinforcement Learning of Deformable Object Manipulation for Robot-Assisted Surgery" (Scheikl et al., 2024). Wang et al., "RetinaGAN: An Object-aware Approach to Sim-to-Real Transfer" (Ho et al., 2020). Klamt et al., "Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real" (Huang et al., 30 Sep 2025). Elhoushi et al., "Sim-to-Real Transfer via a Style-Identified Cycle Consistent Generative Adversarial Network" (Güitta-López et al., 23 Jan 2026). Kalmes et al., "Robust Visual Sim-to-Real Transfer for Robotic Manipulation" (Garcia et al., 2023). Peschel et al., "Depth Transfer: Learning to See Like a Simulator for Real-World Drone Navigation" (Yu et al., 18 May 2025). Strbac et al., "Tactile Sim-to-Real Policy Transfer via Real-to-Sim Image Translation" (Church et al., 2021). Brunner et al., "Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets" (Coholich et al., 14 Jan 2026). Wu et al., "RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer" (Liu et al., 29 May 2025). Han et al., "A Data-Efficient Framework for Training and Sim-to-Real Transfer of Navigation Policies" (Bharadhwaj et al., 2018). Zhang et al., "Modular Deep Q Networks for Sim-to-real Transfer of Visuo-motor Policies" (Zhang et al., 2016). Bousmalis et al., "Adversarial Discriminative Sim-to-real Transfer of Visuo-motor Policies" (Zhang et al., 2017).