Soft-Body Manipulation Benchmark

Updated 21 November 2025

Soft-body manipulation benchmarks are standardized task suites that rigorously assess robotics systems interacting with deformable materials like sand, cloth, and plasticine.
They employ diverse simulation methods—such as MPM, FEM, and PBD—to capture non-linear, high-dimensional dynamics and complex contact interactions.
These benchmarks enable direct comparisons of reinforcement learning, imitation learning, and optimization strategies while addressing sim-to-real transfer challenges.

A soft-body manipulation benchmark is a standardized simulation or real-world task suite designed to rigorously assess the capabilities of algorithms, robots, and systems in interacting with highly deformable objects—including granular matter, elastoplastic solids, cloth, and articulated deformables—under well-defined initial conditions, constraints, and evaluation metrics. These benchmarks formalize manipulation as control or planning problems where a robot (virtual or physical) must shape, move, or interact with soft materials to achieve precise target outcomes, with a strong emphasis on modeling the non-linear, high-dimensional dynamics and contact interactions particular to the soft-body domain.

1. Benchmark Task Suites and Canonical Problems

Soft-body manipulation benchmarks span a diversity of materials and challenges to stress-test the full range of skills necessary for deployment in real-world environments. Canonical benchmarks comprise:

Granular media and unbounded soft environments: UBSoft defines four large-scale, high-fidelity manipulation tasks using dry sand—Sand Painting, Scoop Out, Dig A Hole, and Smooth Surface—implemented as finite-horizon Markov decision processes with spatially adaptive MLS-MPM physics (Lin et al., 2024).
Elasto-plastic solids and multi-materials: ManiSkill2’s six tasks (Fill, Hang, Excavate, Pour, Pinch, Write) employ plasticine, water, and flexible noodles, using GPU-accelerated MPM solvers (Gu et al., 2023, Li et al., 2024).
Cloth and garment manipulation tasks: RGBench (Real Garment Benchmark) and GarmentLab target 3D cloth models, folding, flinging, dressing avatars, and interacting with fluids/rigids/human forms using high-resolution FEM or PBD physics (Hu et al., 9 Nov 2025, Lu et al., 2024).
Multi-stage, language/vision-specified goals: SoftVL100 aggregates 100 human-inspired soft-body tasks with multi-stage keyframes, covering deformation, folding, winding, and cutting, each annotated by vision-language inputs and optimized via differentiable physics (Huang et al., 2023).
Mobile and continuum-robot manipulation: MoDeSuite assesses mobile base and arm coordination in tasks such as dragging, placing, and lifting elastic and plastic objects (rods, cloth) (Zhang et al., 29 Jul 2025).
Physical human–robot collaboration: Benchmarks with continuum soft manipulators quantify co-manipulation performance, comparing human–robot teams to human–human dyads in large object transport and positioning (Cordon et al., 11 Apr 2025).

2. Physics Engines and Simulation Methodologies

Benchmarks rely on specialized simulation backends that support the unique challenges of soft-body dynamics:

Material Point Method (MPM) and MLS-MPM: Provides grid–particle hybrid simulation of elastoplastic, granular, and fluid-like behavior, efficient for differentiable trajectory optimization and supporting dynamic contact. UBSoft utilizes a hierarchical, octree-based adaptive grid for scalable simulation of unbounded soft environments, where fine grid cells cluster near the manipulator (Lin et al., 2024). ManiSkill2 and SoftVL100 also deploy MLS-MPM (Gu et al., 2023, Huang et al., 2023).
Finite Element Method (FEM): Used for high-fidelity, anisotropic simulation of garments and thin shells (RGBench, GarmentLab), often with GPU acceleration and inexact Newton/PCG solvers for implicit time stepping. Quadratic bending terms are critical for capturing wrinkle dynamics in cloth (Hu et al., 9 Nov 2025, Lu et al., 2024).
Position-Based Dynamics (PBD): Powers real-time cloth/wire/rope/fluids simulation (GarmentLab, SoftGym) using constraint projection and collision handling, trading some realism for throughput and robustness (Lu et al., 2024, Lin et al., 2020).
Differentiable Physics Engines: Key for trajectory optimization, enabling gradient-based planners and hybrid policy learning. PlasticineLab and CPDeform leverage differentiable MPM (DiffTaichi), supporting both elastic and plastic flows with analytic gradients through contact (Huang et al., 2021, Li et al., 2022).

3. Task Formulations, Metrics, and Evaluation Protocols

Soft-body benchmarks specify manipulation as episodic control/optimization problems with precise evaluation criteria:

Reward/Objective Functions:
- Shape Matching: Chamfer distance, Intersection-over-Union (IoU), or Wasserstein-1 distance between current and goal object configurations (e.g., volumetric match for platform-carved sand or molded plasticine) (Huang et al., 2021, Lin et al., 2024, Li et al., 2022).
- Composite Multi-stage Constraints: Success defined over sequences of keyframes and subgoals (SoftVL100, PlasticineLab-M), requiring planning over contact switch events (Huang et al., 2023, Li et al., 2022).
- Task-specific Sparse and Dense Rewards: E.g., particle count in target area, poured volume accuracy (±4 mm), surface area metrics for cloth flatness, or energy penalties for deformation (Gu et al., 2023, Hu et al., 9 Nov 2025, Lin et al., 2020).
Particle, Mesh, and Point Cloud Observations:
- Partial and full-state observation formats provided, with fused point clouds (e.g., ManiSkill2's ≥2000 points from multi-camera depth) and/or full mesh access for state-based methods.
Success Criteria:
- Binary task completion (e.g., ≥90% clay in beaker, draw IoU > 0.8), cumulative reward over horizon, deformation error, sim-to-real matching error metrics (bidirectional Chamfer/Hausdorff distance on cloth) (Hu et al., 9 Nov 2025).
Evaluation Splits & Protocols:
- Extensive randomization of initial states, held-out test splits (e.g., 100 evaluation episodes/task in ManiSkill2), benchmarking of RL, IL, and trajectory optimization baselines. Simulation-to-robot transfer is directly assessed for some platforms (Lin et al., 2024, Zhang et al., 29 Jul 2025, Hu et al., 9 Nov 2025).

4. Representative Algorithms and Baseline Performance

Soft-body benchmarks serve as testbeds for a spectrum of algorithms:

Reinforcement Learning: On-policy (PPO) and off-policy (SAC) methods are standard, with reduced-state or point cloud visual inputs. Performance is high on low-DoF or reward-shaped domains, but RL agents universally underperform in high-dimensional, partially observed, or multi-stage deformable tasks—median RL success often < 50%, especially with image-based observations (Lin et al., 2020, Li et al., 2024).
Behavior Cloning and Imitation Learning: Demonstration-based policies are widely benchmarked (e.g., GP2E behavior cloning policy, ManiSkill2), with guided self-attention and fine-tuning strategies delivering significant improvements (~43% average success rate in GP2E versus baseline 25%) on generalized soft-body manipulation (Li et al., 2024).
Gradient-Based Trajectory Optimization: Differentiable planners (DiffTaichi, DiffVL) exploit analytic gradients from the physics engine to rapidly converge to solutions on single-stage, smooth-contact problems (e.g., PlasticineLab’s “Move,” “RollingPin”). However, performance declines with additional non-convexity, contact switching, or longer horizons (Huang et al., 2021, Huang et al., 2023).
Sampling-Based Optimization: Evolution strategies like CMA-ES can substantially outperform RL and trajectory optimization on high-dimensional, non-smooth tasks (UBSoft: best results in 3/4 tasks), despite greater computational cost and lack of gradient information (Lin et al., 2024).
Contact Point Discovery: CPDeform integrates optimal-transport–based prioritization for selecting end-effector contacts, significantly improving success in multi-stage, complex shape manipulation benchmarks such as PlasticineLab-M (Li et al., 2022).
Design + Control Co-Optimization: Evolution Gym jointly evolves both soft-robot morphology (via genetic algorithms, BO, or CPPNs) and neural control policies, demonstrating emergent manipulation strategies such as shaped “cups” and parallel-jaw “grippers” (Bhatia et al., 2022).

5. Real-World Transfer, Sim-to-Real Alignment, and Evaluation

Modern benchmarks increasingly validate the sim-to-real gap:

Digital Twin Construction and Material Calibration: UBSoft builds physical replicas of simulated sand tasks, tuning material parameters until free-surface behavior converges between simulation and optic flow data. Open-loop CMA-ES plans execute on real robotic arms, demonstrating successful transfer for inscribing shapes and extracting buried objects (Lin et al., 2024).
Cloth and Garment Verification: RGBench aligns simulated meshes and robot trajectories with lidar-derived ground truth. Chamfer distance and Hausdorff error reductions of 28–79% over baselines indicate improved sim-to-real fidelity for cloth grasping, folding, and flinging across a range of garments (Hu et al., 9 Nov 2025).
Physical Human–Robot Collaboration: Benchmarks quantify collaborative manipulation performance (e.g., completion time, path efficiency, workload, trust/safety ratings) between passive-compliant soft robots and human operators, finding performance gaps of ~12–15% relative to human–human teams, with robust qualitative acceptance (Cordon et al., 11 Apr 2025).
Generalization and Domain Adaptation: Techniques such as point-cloud affine/semantic alignment, noisy observation augmentation, and keypoint embedding are essential for bridging visual and dynamics differences, especially in garment manipulation (Lu et al., 2024).

6. Open Challenges and Future Directions

Despite significant progress, several open problems remain central in soft-body manipulation benchmarking:

Scaling to Long-horizon, Multi-contact Tasks: Pure trajectory optimization struggles with contact switching and multi-stage planning; RL fails to explore sufficiently in high-DoF settings (Li et al., 2022, Huang et al., 2023, Huang et al., 2021).
Robustness under Partial Observability: Single-view image-based RL exhibits large generalization gaps due to occlusion and the absence of velocity information. Multi-modal (point cloud, tactile) and history-augmented agents are required (Lin et al., 2020, Lu et al., 2024).
Sim-to-Real Gap: Nonlinear material behavior, unmodeled physics (wrinkles, friction, air/fluid–cloth coupling), and visual complexity make direct transfer challenging. Further work on domain randomization, real-to-sim parameter fitting, and hardware-in-the-loop learning is needed (Hu et al., 9 Nov 2025, Lu et al., 2024).
Benchmark Diversity and Standardization: Continued expansion of public benchmarks (e.g., garment types, granular tasks, collaborative settings), open APIs, and standardized protocols (train/val/test splits, reporting, reproducibility) remains a community priority.

These benchmarks jointly constitute a core infrastructure for advancing deformable object manipulation, establishing rigorous data-driven evaluation and enabling quantifiable progress in algorithm and system development for the next-generation of robotic agents.