Cloth Dynamics Splatting (CloDS)
- Cloth Dynamics Splatting is a framework that infers 3D cloth mesh states and dynamics from synchronized RGB videos without direct supervision.
- It uses a three-stage pipeline—video-to-geometry grounding, mesh tracking with differentiable inversion, and graph neural network-based dynamics learning.
- Experimental evaluations show high simulation accuracy and robust performance under self-occlusion and non-linear deformations.
Cloth Dynamics Splatting (CloDS) is a paradigm for learning and estimating the physical state and temporal evolution of cloth under unknown conditions, using only visual input from multi-view RGB sequences. CloDS methods leverage mesh-constrained, differentiable 3D Gaussian splatting for video-to-geometry grounding and state tracking, and employ graph-based neural simulators for unsupervised dynamics learning. This approach establishes a framework where no direct supervision of mesh states, material parameters, or environmental conditions is required—enabling robust and generalizable cloth simulation, state estimation, and future prediction from vision alone (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).
1. Problem Formulation and Mathematical Foundations
The central challenge addressed by Cloth Dynamics Splatting is Cloth Dynamics Grounding (CDG): inferring the sequence of 3D cloth mesh states and modeling cloth dynamics , given only multi-view synchronized RGB videos of cloth subject to unknown material and environmental conditions. The cloth at each timestep is represented as a mesh , where encodes the node positions in world coordinates, are mesh-space (UV) coordinates, and specifies edge connectivity.
Unsupervised training proceeds by jointly learning:
- a differentiable rendering likelihood that “grounds” the 3D mesh in the observed pixels
- a mesh-based transition model for temporal dynamics
The Bayesian filtering formulation underpins the methodology:
where is the filtering posterior, and is the rendering likelihood (Zhan et al., 2 Feb 2026).
2. Three-Stage CloDS Workflow
CloDS frameworks are characterized by a three-stage unsupervised pipeline:
- Video-to-Geometry Grounding: A mesh is anchored with a set of anisotropic 3D Gaussians per frame, positioned using barycentric coordinates on mesh faces. Each Gaussian has a center , covariance (aligned via face normals), color (possibly parameterized by spherical harmonics), and an adaptive opacity .
- Geometry Refinement and Mesh Tracking: To recover the 3D cloth shape at each frame, differentiable inversion is performed: offsets are optimized (via backpropagation) so that rendered projections of the deformed mesh match the observed RGB images, minimizing geometry losses that combine rendered-vs-observed pixel distances and mesh isometry terms.
- Dynamics Model Training: The resulting mesh trajectories are used as pseudo-ground truth to supervise a mesh-based Graph Neural Network (GNN) simulator (e.g., Mesh Graphormer Network) to learn time-evolution , using a rollout loss over a temporal horizon (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).
3. Mesh-Anchored 3D Gaussian Splatting and Rendering
Gaussian splatting for cloth imposes a mesh-anchored parameterization, enabling geometry-aware and fully-differentiable rendering. Key procedural details:
- Gaussian Anchoring: For each face, Gaussians are placed at barycentric interpolations of the current vertex positions: .
- Density and Compositing: The per-Gaussian density is . Pixel color along a camera ray is composited using the standard front-to-back alpha-blending procedure.
- Dual-Position Opacity Modulation: Opacity for each Gaussian is adaptively determined by both its world-space and mesh-space coordinates: , where is an MLP. This mechanism addresses issues of self-occlusion and ensures stable inversion by avoiding both erroneous transparency and perspective distortion (Zhan et al., 2 Feb 2026).
- Differentiable Mapping: Rendering gradients are backpropagated through the entire pipeline, allowing vertex and Gaussian parameters to be updated based on photometric loss with no mesh supervision requirement.
This approach is distinct from prior pixel-supervised or fixed-opacity splatting methods (e.g., GaMeS), yielding undistorted renderings and stable error accumulation even under large non-linear deformations and occlusions.
4. Unsupervised Geometric Tracking and Dynamics Learning
Unsupervised geometry recovery and dynamics modeling are achieved via the following losses and optimization procedure:
- Rendering Loss (): Combines pixel loss and Differentiable SSIM (D-SSIM) between rendered and observed images. Typical weighting: .
- Edge Loss (): Penalizes deviations in mesh edge lengths to preserve isometry and discourage physically implausible collapses (–$0.05$).
- Rollout Dynamics Loss (): Sums over rollout horizon ( typical), enforcing node-wise consistency in predicted and extracted world coordinates.
Optimization proceeds in stages:
- Gaussian Fitting: Gaussian parameters are optimized on the initial frame (200 epochs; ).
- Mesh Extraction: At each timestep, vertex positions are updated to minimize , until rendered views match observations.
- GNN Training: The mesh-based simulator is trained on the full sequence with rollout loss (1000 epochs; , Adam optimizer). All procedures are fully unsupervised, requiring no ground-truth mesh data at any point (Zhan et al., 2 Feb 2026).
5. Experimental Evaluation and Benchmarking
CloDS frameworks have been evaluated on large-scale synthetic and real-world datasets:
- Synthetic Cloth Simulations: FLAGSIMPLE (ArcSim) with 1000 trajectories, up to 400 steps each, triangular regular meshes, rendered at resolution from up to 30 cameras.
- Quantitative Metrics:
- CDG Rollout RMSE (average per-node position error): CloDS (unsupervised, all videos) achieves units, nearly matching the Mesh Graphormer Network (MGN) in fully supervised settings (). On unseen trajectories, CloDS yields units vs. for MGN.
- Novel-View Synthesis: SMGS (CloDS) achieves PSNR 36.24 dB, SSIM 0.995.
- Forward Video Prediction: CloDS outperforms state-of-the-art video predictors such as SimVP in PSNR ($26.62$ vs $25.47$ dB) and video-RMSE ($0.0478$ vs $0.0557$).
- Qualitative Findings: CloDS maintains sharp cloth edges and plausible dynamic evolution, robustly handling self-occlusion and ambiguous correspondences.
- Ablation Results: Ablations demonstrate that both absolute () and relative () conditioning of opacity are essential for stable inversion and avoidance of perspective or transparency artifacts. Performance is robust to changes in camera and Gaussian count (Zhan et al., 2 Feb 2026).
A comparable framework, Cloth-Splatting (Longhini et al., 3 Jan 2025), corroborates these findings and benchmarks favorably against alternative mesh tracking and estimation baselines (e.g., RAFT-Oracle, DynaGS).
6. Generalization, Applications, and Limitations
CloDS methods demonstrate strong generalization across geometric and textural variability, as well as extension to challenging scenarios:
- Shape and Texture Generalization: Models predict cylinder-shaped and mechanically simulated real-fashion cloth motions accurately, and performance is stable under unseen UV patterns.
- Complex Environments: Object–cloth collision dynamics can be learned by assigning rigid attributes to object Gaussians. Multi-body interactions under unknown forces are captured.
- Real-World Video: Preliminary applications to multi-view real videos (with SAM-based cloth segmentation) are successful despite lighting and sensor noise.
Limitations include:
- Initial Mesh Estimation: CloDS requires a plausible starting mesh, constructed via 2D Gaussian Splatting plus TSDF fusion, but small perturbations in these estimates introduce only minor degradation in RMSE.
- Lighting and Clutter: Complex illumination can reduce inversion fidelity; robust extension to highly cluttered scenes and joint segmentation remain open research avenues.
A plausible implication is that integration of learned BRDF representations or shadow-aware rendering could further enhance applicability in real-world captures (Zhan et al., 2 Feb 2026).
7. Relation to Prior Work and Outlook
CloDS extends prior mesh-based and pixel-supervised approaches, notably surpassing methods that use only 2D or fixed-parameter state trackers in capturing the spatiotemporal complexity of cloth. The mesh-anchored, differentiable splatting formulation forms a bidirectional bridge between 2D observations and inferred 3D geometry, supporting generalization and physical plausibility even in the absence of mesh or parameter supervision.
The methodology’s modularity has been demonstrated by substituting different GNN backbones (e.g., HCMT, DHMP, BSMS-GNN) without loss of accuracy, suggesting extensibility to other non-rigid or multi-body dynamic scenarios. CloDS establishes a general-purpose, robust framework for visual-only, unsupervised simulation and understanding of cloth and potentially other articulated or highly deformable objects (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).