- The paper introduces SoMA, a real-to-sim neural simulator that integrates robot action conditioning with a hierarchical Gaussian splat representation.
- It employs force-driven dynamics and multi-resolution supervision to enhance simulation stability and generalization in soft-body manipulation tasks.
- Empirical results show at least a 20% improvement in RGB and depth accuracy over baselines, validating the framework’s superiority in complex scenarios.
SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation
Overview
The SoMA framework introduces a neural simulation paradigm for real-to-sim (R2S) soft-body robotic manipulation, directly modeling manipulation dynamics through an action-conditioned, Gaussian Splat (GS)-based representation. SoMA targets critical limitations of existing simulation approaches—namely, reliance on simplified or predefined physical models, lack of capacity for kinematically consistent interaction, and weak generalization to unseen manipulations. Through an explicit, hierarchical, force-driven design, SoMA establishes a unified simulation solution for complex deformable object manipulation, integrating real-world robot actions and environment forces end-to-end in a learned latent space.
Key Technical Contributions
Unified Action-conditioned Neural Simulation
SoMA operates by mapping multi-view RGB observations and synchronized robot joint signals from the real world into a unified simulation space. The object under manipulation is reconstructed using advanced 3D Gaussian Splatting methods, resulting in a hierarchical graph of splats encoding spatial, material, and action-conditioned information. This representation enables both visually accurate rendering and physically meaningful dynamics propagation, accommodating interaction-driven deformations across various soft-body object categories.
The core innovation is conditioning the simulator directly on robot actions (joint-space commands), ensuring kinematic and causal consistency between the robot and the deformable object. This approach sharply contrasts with prior neural approaches that regress motion based solely on past states, resulting in poor generalization, and with physics-based systems that impose fixed parameterizations and lack data-driven adaptability.
Hierarchical Force-driven Dynamics
SoMA models environmental and robot-induced forces directly at the splat level and propagates these through a multi-level hierarchical graph using graph neural networks. The environmental forces (e.g., gravity, contact with surfaces) and robot interaction forces (computed from joint trajectories and gripper state) are hierarchically aggregated and then distributed through the graph, enabling local deformation while preserving global structure. This hierarchical force aggregation offers significant improvements in simulation stability, especially over long-horizon interactions typical of tasks like cloth folding.
Multi-resolution Supervision and Occlusion Handling
A notable design feature is the two-stage multi-resolution training regime for temporal and spatial dynamics, which mitigates error accumulation and drift. During training, the simulator first learns global motion patterns at coarse temporal resolution and then fine-tunes on detailed, high-frequency dynamics. Additionally, SoMA employs blended supervision: rendered images are supervised using occlusion-aware masks to focus the loss on visible regions, while physics-inspired consistency constraints (momentum conservation) regularize unobserved or occluded states. This dual-mode loss formulation significantly improves generalization to novel, occluded, or complex contact situations.
Empirical Evaluation
SoMA is benchmarked on comprehensive, real-world datasets involving manipulation of ropes, cloths, dolls, and T-shirts, using synchronized RGB-D imaging and robot kinematics. Tasks range from simple resimulation to challenging generalization scenarios (unseen manipulation trajectories and action types).
Strong Numerical Results
- On both resimulation and generalization tasks, SoMA demonstrates at least 20% improvement in both RGB and geometric (depth) accuracy metrics over baselines, including PhysTwin (Jiang et al., 23 Mar 2025) and GausSim (Shao et al., 2024).
- For complex tasks like T-shirt folding, SoMA maintains long-horizon stability and high rendering fidelity (PSNR up to 27.57, SSIM 0.896), while baseline simulators exhibit collapse or structural errors.
- Ablation studies affirm the necessity of multi-resolution training and blended (image + physics) supervision—removal of either results in substantial accuracy drops, particularly in generalization to unseen manipulations.
- SoMA displays interaction-consistent and controllable simulation capabilities, remaining robust to action deviations and occlusion-heavy scenarios.
Implications and Future Directions
SoMA represents a significant advancement in real-to-sim neural simulation for soft-body manipulation. Its ability to learn interaction-aware, force-driven dynamics directly from high-dimensional visual and kinematic data reduces the simulation-reality gap, enabling practical application for embodied policy development and data augmentation in robotics. Unlike conventional world-modeling architectures, SoMA achieves explicit physical consistency and controllability—critical for deployment in manipulation policy learning, planning, and real-world transfer.
However, SoMA's reliance on high-quality visual reconstructions and the bounded diversity of training data highlight ongoing challenges. Its generalization may degrade under extreme occlusion or novel contact topologies not seen during training. Scaling the model and training protocols to broader object categories, richer contact types, and more diverse manipulation settings constitutes a direct path for future research. Integration with differentiable planning, multi-agent coordination, or reinforcement policy optimization frameworks is also facilitated by its end-to-end differentiable, action-conditioned structure.
Conclusion
SoMA constitutes a robust, interaction-consistent neural simulation framework for deformable object manipulation under real-world robot actions, operating directly on hierarchical Gaussian splat representations. It achieves strong improvements in long-horizon stability, generalization to unseen manipulations, and physical plausibility—outperforming prior neural and physics-based simulation methods across varied benchmarks. The practical and theoretical advances in action-conditioned, force-driven neural dynamics modeling pave the way for improved embodied learning, simulation-driven robotic policy design, and large-scale virtual dataset generation in complex manipulation scenarios (2602.02402).