Cloth Dynamics Splatting (CloDS)

Updated 9 February 2026

Cloth Dynamics Splatting is a framework that infers 3D cloth mesh states and dynamics from synchronized RGB videos without direct supervision.
It uses a three-stage pipeline—video-to-geometry grounding, mesh tracking with differentiable inversion, and graph neural network-based dynamics learning.
Experimental evaluations show high simulation accuracy and robust performance under self-occlusion and non-linear deformations.

Cloth Dynamics Splatting (CloDS) is a paradigm for learning and estimating the physical state and temporal evolution of cloth under unknown conditions, using only visual input from multi-view RGB sequences. CloDS methods leverage mesh-constrained, differentiable 3D Gaussian splatting for video-to-geometry grounding and state tracking, and employ graph-based neural simulators for unsupervised dynamics learning. This approach establishes a framework where no direct supervision of mesh states, material parameters, or environmental conditions is required—enabling robust and generalizable cloth simulation, state estimation, and future prediction from vision alone (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

1. Problem Formulation and Mathematical Foundations

The central challenge addressed by Cloth Dynamics Splatting is Cloth Dynamics Grounding (CDG): inferring the sequence of 3D cloth mesh states $\{M_t\}$ and modeling cloth dynamics $p(M_{t+1}|M_t)$ , given only multi-view synchronized RGB videos $Y_t = \{I^i_t\}_{i=1}^N$ of cloth subject to unknown material and environmental conditions. The cloth at each timestep is represented as a mesh $M_t = (x^W_t, x^M_t, E)$ , where $x^W_t \in \mathbb{R}^{K \times 3}$ encodes the node positions in world coordinates, $x^M_t \in \mathbb{R}^{K \times 2}$ are mesh-space (UV) coordinates, and $E$ specifies edge connectivity.

Unsupervised training proceeds by jointly learning:

a differentiable rendering likelihood $p(Y_t|M_t)$ that “grounds” the 3D mesh in the observed pixels
a mesh-based transition model $p(M_{t+1} | M_t)$ for temporal dynamics

The Bayesian filtering formulation underpins the methodology:

$p(Y_{t+1}|Y_{1:t}) = \int p(Y_{t+1} | M_{t+1})\, p(M_{t+1} | M_t)\, p(M_t | Y_{1:t})\, dM_t$

where $p(M_t | Y_{1:t})$ is the filtering posterior, and $p(Y_{t+1} | M_{t+1})$ is the rendering likelihood (Zhan et al., 2 Feb 2026).

2. Three-Stage CloDS Workflow

CloDS frameworks are characterized by a three-stage unsupervised pipeline:

Video-to-Geometry Grounding: A mesh is anchored with a set of $K$ anisotropic 3D Gaussians per frame, positioned using barycentric coordinates on mesh faces. Each Gaussian has a center $\mu_{i,t}\in\mathbb{R}^3$ , covariance $\Sigma_{i,t}$ (aligned via face normals), color $c_i$ (possibly parameterized by spherical harmonics), and an adaptive opacity $\alpha_{i,t}$ .
Geometry Refinement and Mesh Tracking: To recover the 3D cloth shape at each frame, differentiable inversion is performed: offsets $\Delta x^W_t$ are optimized (via backpropagation) so that rendered projections of the deformed mesh match the observed RGB images, minimizing geometry losses that combine rendered-vs-observed pixel distances and mesh isometry terms.
Dynamics Model Training: The resulting mesh trajectories are used as pseudo-ground truth to supervise a mesh-based Graph Neural Network (GNN) simulator (e.g., Mesh Graphormer Network) to learn time-evolution $f_\phi(M_t)$ , using a rollout loss over a temporal horizon (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

3. Mesh-Anchored 3D Gaussian Splatting and Rendering

Gaussian splatting for cloth imposes a mesh-anchored parameterization, enabling geometry-aware and fully-differentiable rendering. Key procedural details:

Gaussian Anchoring: For each face, Gaussians are placed at barycentric interpolations of the current vertex positions: $\mu_{i,t} = \sum_{u=1}^3 \beta_u x^W_{t,u}$ .
Density and Compositing: The per-Gaussian density is $G_i(x) = \alpha_{i,t} \exp\left(-\frac{1}{2}(x-\mu_{i,t})^T \Sigma_{i,t}^{-1} (x-\mu_{i,t})\right)$ . Pixel color along a camera ray is composited using the standard front-to-back alpha-blending procedure.
Dual-Position Opacity Modulation: Opacity for each Gaussian is adaptively determined by both its world-space and mesh-space coordinates: $\alpha_{i,t} = f_\theta(\mu^W_{i,t}, \mu^M_i)$ , where $f_\theta$ is an MLP. This mechanism addresses issues of self-occlusion and ensures stable inversion by avoiding both erroneous transparency and perspective distortion (Zhan et al., 2 Feb 2026).
Differentiable Mapping: Rendering gradients are backpropagated through the entire pipeline, allowing vertex and Gaussian parameters to be updated based on photometric loss with no mesh supervision requirement.

This approach is distinct from prior pixel-supervised or fixed-opacity splatting methods (e.g., GaMeS), yielding undistorted renderings and stable error accumulation even under large non-linear deformations and occlusions.

4. Unsupervised Geometric Tracking and Dynamics Learning

Unsupervised geometry recovery and dynamics modeling are achieved via the following losses and optimization procedure:

Rendering Loss ( $L_\text{render}$ ): Combines $L_1$ pixel loss and Differentiable SSIM (D-SSIM) between rendered and observed images. Typical weighting: $\lambda = 0.2$ .
Edge Loss ( $L_\text{edge}$ ): Penalizes deviations in mesh edge lengths to preserve isometry and discourage physically implausible collapses ( $\gamma\approx 0.01$ –$0.05$).
Rollout Dynamics Loss ( $L_\text{dyn}$ ): Sums over rollout horizon $T_\text{rollout}$ ( $T_\text{rollout}=8$ typical), enforcing node-wise $L_2$ consistency in predicted and extracted world coordinates.

Optimization proceeds in stages:

Gaussian Fitting: Gaussian parameters are optimized on the initial frame ( $\sim$ 200 epochs; $lr\approx 5 \times 10^{-3}$ ).
Mesh Extraction: At each timestep, vertex positions are updated to minimize $L_\text{geometry} = L_\text{render} + \gamma L_\text{edge}$ , until rendered views match observations.
GNN Training: The mesh-based simulator is trained on the full sequence with rollout loss ( $\sim$ 1000 epochs; $lr=10^{-4}$ , Adam optimizer). All procedures are fully unsupervised, requiring no ground-truth mesh data at any point (Zhan et al., 2 Feb 2026).

5. Experimental Evaluation and Benchmarking

CloDS frameworks have been evaluated on large-scale synthetic and real-world datasets:

Synthetic Cloth Simulations: FLAGSIMPLE (ArcSim) with 1000 trajectories, up to 400 steps each, triangular regular meshes, rendered at $800\times800$ resolution from up to 30 cameras.
Quantitative Metrics:
- CDG Rollout RMSE (average per-node $L_2$ position error): CloDS (unsupervised, all videos) achieves $0.1285 \pm 0.038$ units, nearly matching the Mesh Graphormer Network (MGN) in fully supervised settings ( $0.1279 \pm 0.026$ ). On unseen trajectories, CloDS yields $0.1381 \pm 0.044$ units vs. $0.1359 \pm 0.029$ for MGN.
- Novel-View Synthesis: SMGS (CloDS) achieves PSNR $\approx$ 36.24 dB, SSIM $\approx$ 0.995.
- Forward Video Prediction: CloDS outperforms state-of-the-art video predictors such as SimVP in PSNR ($26.62$ vs $25.47$ dB) and video-RMSE ($0.0478$ vs $0.0557$).
Qualitative Findings: CloDS maintains sharp cloth edges and plausible dynamic evolution, robustly handling self-occlusion and ambiguous correspondences.
Ablation Results: Ablations demonstrate that both absolute ( $\mu^M$ ) and relative ( $\mu^W$ ) conditioning of opacity are essential for stable inversion and avoidance of perspective or transparency artifacts. Performance is robust to changes in camera and Gaussian count (Zhan et al., 2 Feb 2026).

A comparable framework, Cloth-Splatting (Longhini et al., 3 Jan 2025), corroborates these findings and benchmarks favorably against alternative mesh tracking and estimation baselines (e.g., RAFT-Oracle, DynaGS).

6. Generalization, Applications, and Limitations

CloDS methods demonstrate strong generalization across geometric and textural variability, as well as extension to challenging scenarios:

Shape and Texture Generalization: Models predict cylinder-shaped and mechanically simulated real-fashion cloth motions accurately, and performance is stable under unseen UV patterns.
Complex Environments: Object–cloth collision dynamics can be learned by assigning rigid attributes to object Gaussians. Multi-body interactions under unknown forces are captured.
Real-World Video: Preliminary applications to multi-view real videos (with SAM-based cloth segmentation) are successful despite lighting and sensor noise.

Limitations include:

Initial Mesh Estimation: CloDS requires a plausible starting mesh, constructed via 2D Gaussian Splatting plus TSDF fusion, but small perturbations in these estimates introduce only minor degradation in RMSE.
Lighting and Clutter: Complex illumination can reduce inversion fidelity; robust extension to highly cluttered scenes and joint segmentation remain open research avenues.

A plausible implication is that integration of learned BRDF representations or shadow-aware rendering could further enhance applicability in real-world captures (Zhan et al., 2 Feb 2026).

7. Relation to Prior Work and Outlook

CloDS extends prior mesh-based and pixel-supervised approaches, notably surpassing methods that use only 2D or fixed-parameter state trackers in capturing the spatiotemporal complexity of cloth. The mesh-anchored, differentiable splatting formulation forms a bidirectional bridge between 2D observations and inferred 3D geometry, supporting generalization and physical plausibility even in the absence of mesh or parameter supervision.

The methodology’s modularity has been demonstrated by substituting different GNN backbones (e.g., HCMT, DHMP, BSMS-GNN) without loss of accuracy, suggesting extensibility to other non-rigid or multi-body dynamic scenarios. CloDS establishes a general-purpose, robust framework for visual-only, unsupervised simulation and understanding of cloth and potentially other articulated or highly deformable objects (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (2)

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions (2026)

Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cloth Dynamics Splatting (CloDS).

Cloth Dynamics Splatting (CloDS)

1. Problem Formulation and Mathematical Foundations

2. Three-Stage CloDS Workflow

3. Mesh-Anchored 3D Gaussian Splatting and Rendering

4. Unsupervised Geometric Tracking and Dynamics Learning

5. Experimental Evaluation and Benchmarking

6. Generalization, Applications, and Limitations

7. Relation to Prior Work and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cloth Dynamics Splatting (CloDS)

1. Problem Formulation and Mathematical Foundations

2. Three-Stage CloDS Workflow

3. Mesh-Anchored 3D Gaussian Splatting and Rendering

4. Unsupervised Geometric Tracking and Dynamics Learning

5. Experimental Evaluation and Benchmarking

6. Generalization, Applications, and Limitations

7. Relation to Prior Work and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research