Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cloth Dynamics Splatting (CloDS)

Updated 9 February 2026
  • Cloth Dynamics Splatting is a framework that infers 3D cloth mesh states and dynamics from synchronized RGB videos without direct supervision.
  • It uses a three-stage pipeline—video-to-geometry grounding, mesh tracking with differentiable inversion, and graph neural network-based dynamics learning.
  • Experimental evaluations show high simulation accuracy and robust performance under self-occlusion and non-linear deformations.

Cloth Dynamics Splatting (CloDS) is a paradigm for learning and estimating the physical state and temporal evolution of cloth under unknown conditions, using only visual input from multi-view RGB sequences. CloDS methods leverage mesh-constrained, differentiable 3D Gaussian splatting for video-to-geometry grounding and state tracking, and employ graph-based neural simulators for unsupervised dynamics learning. This approach establishes a framework where no direct supervision of mesh states, material parameters, or environmental conditions is required—enabling robust and generalizable cloth simulation, state estimation, and future prediction from vision alone (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

1. Problem Formulation and Mathematical Foundations

The central challenge addressed by Cloth Dynamics Splatting is Cloth Dynamics Grounding (CDG): inferring the sequence of 3D cloth mesh states {Mt}\{M_t\} and modeling cloth dynamics p(Mt+1Mt)p(M_{t+1}|M_t), given only multi-view synchronized RGB videos Yt={Iti}i=1NY_t = \{I^i_t\}_{i=1}^N of cloth subject to unknown material and environmental conditions. The cloth at each timestep is represented as a mesh Mt=(xtW,xtM,E)M_t = (x^W_t, x^M_t, E), where xtWRK×3x^W_t \in \mathbb{R}^{K \times 3} encodes the node positions in world coordinates, xtMRK×2x^M_t \in \mathbb{R}^{K \times 2} are mesh-space (UV) coordinates, and EE specifies edge connectivity.

Unsupervised training proceeds by jointly learning:

  • a differentiable rendering likelihood p(YtMt)p(Y_t|M_t) that “grounds” the 3D mesh in the observed pixels
  • a mesh-based transition model p(Mt+1Mt)p(M_{t+1} | M_t) for temporal dynamics

The Bayesian filtering formulation underpins the methodology:

p(Yt+1Y1:t)=p(Yt+1Mt+1)p(Mt+1Mt)p(MtY1:t)dMtp(Y_{t+1}|Y_{1:t}) = \int p(Y_{t+1} | M_{t+1})\, p(M_{t+1} | M_t)\, p(M_t | Y_{1:t})\, dM_t

where p(MtY1:t)p(M_t | Y_{1:t}) is the filtering posterior, and p(Yt+1Mt+1)p(Y_{t+1} | M_{t+1}) is the rendering likelihood (Zhan et al., 2 Feb 2026).

2. Three-Stage CloDS Workflow

CloDS frameworks are characterized by a three-stage unsupervised pipeline:

  1. Video-to-Geometry Grounding: A mesh is anchored with a set of KK anisotropic 3D Gaussians per frame, positioned using barycentric coordinates on mesh faces. Each Gaussian has a center μi,tR3\mu_{i,t}\in\mathbb{R}^3, covariance Σi,t\Sigma_{i,t} (aligned via face normals), color cic_i (possibly parameterized by spherical harmonics), and an adaptive opacity αi,t\alpha_{i,t}.
  2. Geometry Refinement and Mesh Tracking: To recover the 3D cloth shape at each frame, differentiable inversion is performed: offsets ΔxtW\Delta x^W_t are optimized (via backpropagation) so that rendered projections of the deformed mesh match the observed RGB images, minimizing geometry losses that combine rendered-vs-observed pixel distances and mesh isometry terms.
  3. Dynamics Model Training: The resulting mesh trajectories are used as pseudo-ground truth to supervise a mesh-based Graph Neural Network (GNN) simulator (e.g., Mesh Graphormer Network) to learn time-evolution fϕ(Mt)f_\phi(M_t), using a rollout loss over a temporal horizon (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

3. Mesh-Anchored 3D Gaussian Splatting and Rendering

Gaussian splatting for cloth imposes a mesh-anchored parameterization, enabling geometry-aware and fully-differentiable rendering. Key procedural details:

  • Gaussian Anchoring: For each face, Gaussians are placed at barycentric interpolations of the current vertex positions: μi,t=u=13βuxt,uW\mu_{i,t} = \sum_{u=1}^3 \beta_u x^W_{t,u}.
  • Density and Compositing: The per-Gaussian density is Gi(x)=αi,texp(12(xμi,t)TΣi,t1(xμi,t))G_i(x) = \alpha_{i,t} \exp\left(-\frac{1}{2}(x-\mu_{i,t})^T \Sigma_{i,t}^{-1} (x-\mu_{i,t})\right). Pixel color along a camera ray is composited using the standard front-to-back alpha-blending procedure.
  • Dual-Position Opacity Modulation: Opacity for each Gaussian is adaptively determined by both its world-space and mesh-space coordinates: αi,t=fθ(μi,tW,μiM)\alpha_{i,t} = f_\theta(\mu^W_{i,t}, \mu^M_i), where fθf_\theta is an MLP. This mechanism addresses issues of self-occlusion and ensures stable inversion by avoiding both erroneous transparency and perspective distortion (Zhan et al., 2 Feb 2026).
  • Differentiable Mapping: Rendering gradients are backpropagated through the entire pipeline, allowing vertex and Gaussian parameters to be updated based on photometric loss with no mesh supervision requirement.

This approach is distinct from prior pixel-supervised or fixed-opacity splatting methods (e.g., GaMeS), yielding undistorted renderings and stable error accumulation even under large non-linear deformations and occlusions.

4. Unsupervised Geometric Tracking and Dynamics Learning

Unsupervised geometry recovery and dynamics modeling are achieved via the following losses and optimization procedure:

  • Rendering Loss (LrenderL_\text{render}): Combines L1L_1 pixel loss and Differentiable SSIM (D-SSIM) between rendered and observed images. Typical weighting: λ=0.2\lambda = 0.2.
  • Edge Loss (LedgeL_\text{edge}): Penalizes deviations in mesh edge lengths to preserve isometry and discourage physically implausible collapses (γ0.01\gamma\approx 0.01–$0.05$).
  • Rollout Dynamics Loss (LdynL_\text{dyn}): Sums over rollout horizon TrolloutT_\text{rollout} (Trollout=8T_\text{rollout}=8 typical), enforcing node-wise L2L_2 consistency in predicted and extracted world coordinates.

Optimization proceeds in stages:

  1. Gaussian Fitting: Gaussian parameters are optimized on the initial frame (\sim200 epochs; lr5×103lr\approx 5 \times 10^{-3}).
  2. Mesh Extraction: At each timestep, vertex positions are updated to minimize Lgeometry=Lrender+γLedgeL_\text{geometry} = L_\text{render} + \gamma L_\text{edge}, until rendered views match observations.
  3. GNN Training: The mesh-based simulator is trained on the full sequence with rollout loss (\sim1000 epochs; lr=104lr=10^{-4}, Adam optimizer). All procedures are fully unsupervised, requiring no ground-truth mesh data at any point (Zhan et al., 2 Feb 2026).

5. Experimental Evaluation and Benchmarking

CloDS frameworks have been evaluated on large-scale synthetic and real-world datasets:

  • Synthetic Cloth Simulations: FLAGSIMPLE (ArcSim) with 1000 trajectories, up to 400 steps each, triangular regular meshes, rendered at 800×800800\times800 resolution from up to 30 cameras.
  • Quantitative Metrics:
    • CDG Rollout RMSE (average per-node L2L_2 position error): CloDS (unsupervised, all videos) achieves 0.1285±0.0380.1285 \pm 0.038 units, nearly matching the Mesh Graphormer Network (MGN) in fully supervised settings (0.1279±0.0260.1279 \pm 0.026). On unseen trajectories, CloDS yields 0.1381±0.0440.1381 \pm 0.044 units vs. 0.1359±0.0290.1359 \pm 0.029 for MGN.
    • Novel-View Synthesis: SMGS (CloDS) achieves PSNR \approx 36.24 dB, SSIM \approx 0.995.
    • Forward Video Prediction: CloDS outperforms state-of-the-art video predictors such as SimVP in PSNR ($26.62$ vs $25.47$ dB) and video-RMSE ($0.0478$ vs $0.0557$).
  • Qualitative Findings: CloDS maintains sharp cloth edges and plausible dynamic evolution, robustly handling self-occlusion and ambiguous correspondences.
  • Ablation Results: Ablations demonstrate that both absolute (μM\mu^M) and relative (μW\mu^W) conditioning of opacity are essential for stable inversion and avoidance of perspective or transparency artifacts. Performance is robust to changes in camera and Gaussian count (Zhan et al., 2 Feb 2026).

A comparable framework, Cloth-Splatting (Longhini et al., 3 Jan 2025), corroborates these findings and benchmarks favorably against alternative mesh tracking and estimation baselines (e.g., RAFT-Oracle, DynaGS).

6. Generalization, Applications, and Limitations

CloDS methods demonstrate strong generalization across geometric and textural variability, as well as extension to challenging scenarios:

  • Shape and Texture Generalization: Models predict cylinder-shaped and mechanically simulated real-fashion cloth motions accurately, and performance is stable under unseen UV patterns.
  • Complex Environments: Object–cloth collision dynamics can be learned by assigning rigid attributes to object Gaussians. Multi-body interactions under unknown forces are captured.
  • Real-World Video: Preliminary applications to multi-view real videos (with SAM-based cloth segmentation) are successful despite lighting and sensor noise.

Limitations include:

  • Initial Mesh Estimation: CloDS requires a plausible starting mesh, constructed via 2D Gaussian Splatting plus TSDF fusion, but small perturbations in these estimates introduce only minor degradation in RMSE.
  • Lighting and Clutter: Complex illumination can reduce inversion fidelity; robust extension to highly cluttered scenes and joint segmentation remain open research avenues.

A plausible implication is that integration of learned BRDF representations or shadow-aware rendering could further enhance applicability in real-world captures (Zhan et al., 2 Feb 2026).

7. Relation to Prior Work and Outlook

CloDS extends prior mesh-based and pixel-supervised approaches, notably surpassing methods that use only 2D or fixed-parameter state trackers in capturing the spatiotemporal complexity of cloth. The mesh-anchored, differentiable splatting formulation forms a bidirectional bridge between 2D observations and inferred 3D geometry, supporting generalization and physical plausibility even in the absence of mesh or parameter supervision.

The methodology’s modularity has been demonstrated by substituting different GNN backbones (e.g., HCMT, DHMP, BSMS-GNN) without loss of accuracy, suggesting extensibility to other non-rigid or multi-body dynamic scenarios. CloDS establishes a general-purpose, robust framework for visual-only, unsupervised simulation and understanding of cloth and potentially other articulated or highly deformable objects (Zhan et al., 2 Feb 2026, Longhini et al., 3 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cloth Dynamics Splatting (CloDS).