Mesh4D: 4D Mesh Reconstruction & Tracking

Updated 4 July 2026

The paper introduces a feed-forward model that recovers a canonical 3D mesh from the first video frame and captures the full sequence as a dense deformation field.
It presents a deformation VAE with skeleton-guided privileged training that compresses sequence-level motion into a compact latent representation, reducing tracking jitter.
Mesh4D employs a conditional deformation diffusion model to predict per-vertex displacements, achieving superior geometry and novel-view synthesis metrics.

Mesh4D is a feed-forward model for monocular 4D mesh reconstruction and tracking that, given a single RGB video of a moving object, reconstructs a complete 3D mesh of the object and its time-varying motion as a deformation field. The method factors the problem into a canonical mesh reconstructed from the first frame and a dense deformation field predicted for the full sequence in one shot; this yields dense temporal correspondence because the same vertex indices and surface parameterization are preserved over time (Jiang et al., 8 Jan 2026).

1. Formulation and output representation

Mesh4D takes as input a monocular video

$I = \{I_t\}_{t=1}^{T}, \qquad I_t \in \mathbb{R}^{H \times W \times 3},$

and recovers the object’s shape in the first frame as a mesh

$M_1 = \langle V_1, F_1 \rangle,$

where $V_1 \in \mathbb{R}^{N_v \times 3}$ are vertices and $F_1 \in \mathbb{N}^{N_f \times 3}$ are triangular faces, together with a dense deformation field $T_{1 \rightarrow t}$ that maps the canonical mesh at time $1$ to each later time $t$ . The deformed mesh at time $t$ is

$M_t = \left\langle V_1 + T_{1 \rightarrow t}(V_1), F_1 \right\rangle.$

The overall target mapping is

$\Phi: I \mapsto M_1, \{ T_{1 \rightarrow t} \}_{t=1}^{T}.$

This formulation is central to the identity of Mesh4D. It does not predict an unrelated 3D shape for each frame. Instead, it predicts one canonical mesh plus a time-indexed displacement of all surface points. Dense tracking is therefore maintained implicitly by construction: the $M_1 = \langle V_1, F_1 \rangle,$ 0-th vertex at any time $M_1 = \langle V_1, F_1 \rangle,$ 1 corresponds to the same canonical point $M_1 = \langle V_1, F_1 \rangle,$ 2. The paper positions this as stronger than pairwise tracking or nearest-neighbor matching between independently reconstructed frames, and as a way to address the harder monocular problem of reconstructing complete geometry and motion rather than only visible geometry (Jiang et al., 8 Jan 2026).

A common misconception is that Mesh4D is primarily a rendering method. The paper instead frames it as structured 4D reconstruction: complete object shape, explicit motion, persistent topology, and dense correspondences across time. This representation also means that the quality of the first-frame mesh is structurally important, because later frames inherit its topology and parameterization.

2. Canonical shape recovery and the sequence-level deformation latent

The pipeline has three stages. First, Mesh4D reconstructs a canonical mesh from the first frame $M_1 = \langle V_1, F_1 \rangle,$ 3 using the pretrained static image-to-3D model Hunyuan3D 2.1. Second, it learns a deformation VAE that compresses the motion of an entire mesh sequence into a compact latent code. Third, it trains a conditional latent diffusion / flow-matching model that predicts this deformation latent from the input video and the canonical mesh, then decodes it into the full per-vertex deformation field (Jiang et al., 8 Jan 2026).

The deformation VAE is the paper’s principal representational contribution. Rather than encoding frame pairs or independent per-frame states, it learns a latent for the motion of the entire sequence in a single pass. To build the encoder input, the method samples a point cloud $M_1 = \langle V_1, F_1 \rangle,$ 4 from the canonical mesh $M_1 = \langle V_1, F_1 \rangle,$ 5, then uses barycentric coordinates to transfer each sampled point across the sequence, creating corresponding point clouds $M_1 = \langle V_1, F_1 \rangle,$ 6. The per-time input feature is

$M_1 = \langle V_1, F_1 \rangle,$ 7

where $M_1 = \langle V_1, F_1 \rangle,$ 8 is positional embedding, $M_1 = \langle V_1, F_1 \rangle,$ 9 is channel-wise concatenation, and $V_1 \in \mathbb{R}^{N_v \times 3}$ 0 is a linear layer.

After initial embedding, Mesh4D compresses the spatial dimension with Farthest Point Sampling. Starting from $V_1 \in \mathbb{R}^{N_v \times 3}$ 1 sampled points, it reduces to $V_1 \in \mathbb{R}^{N_v \times 3}$ 2 latent tokens. The sparse latent tokens are obtained by cross-attending sampled FPS points to the denser point features: $V_1 \in \mathbb{R}^{N_v \times 3}$ 3 The encoder then applies $V_1 \in \mathbb{R}^{N_v \times 3}$ 4 transformer blocks that alternate temporal attention, global attention, and spatial attention, with 1D RoPE positional encoding on the temporal dimension for temporal and global attention. The mean and variance projections satisfy

$V_1 \in \mathbb{R}^{N_v \times 3}$ 5

with $V_1 \in \mathbb{R}^{N_v \times 3}$ 6 and $V_1 \in \mathbb{R}^{N_v \times 3}$ 7, and the latent is sampled as

$V_1 \in \mathbb{R}^{N_v \times 3}$ 8

The decoder lifts $V_1 \in \mathbb{R}^{N_v \times 3}$ 9 back to dimension $F_1 \in \mathbb{N}^{N_f \times 3}$ 0, applies 16 spatio-temporal attention blocks, and uses the canonical vertices $F_1 \in \mathbb{N}^{N_f \times 3}$ 1 as query points in a cross-attention stage to recover the per-vertex deformation field. The VAE objective is

$F_1 \in \mathbb{N}^{N_f \times 3}$ 2

and in training $F_1 \in \mathbb{N}^{N_f \times 3}$ 3. The paper argues that this sequence-level latent yields a more stable representation of the object’s overall deformation and reduces jitter because the whole animation is encoded jointly rather than framewise (Jiang et al., 8 Jan 2026).

3. Skeleton-guided privileged training

A distinctive aspect of Mesh4D is that the deformation latent is learned with skeleton-guided privileged information during training. Skeleton information is used not because it is required at test time, but because it provides strong priors about plausible articulated deformation. The paper injects two forms of skeletal cues into the VAE encoder: skinning weights and bones (Jiang et al., 8 Jan 2026).

The skinning weights

$F_1 \in \mathbb{N}^{N_f \times 3}$ 4

describe how strongly each bone influences each sampled point. The encoder applies self-attention to the point features but masks or biases it according to skinning similarity: $F_1 \in \mathbb{N}^{N_f \times 3}$ 5 with

$F_1 \in \mathbb{N}^{N_f \times 3}$ 6

Points influenced by similar bones therefore attend to one another.

Bone geometry is also injected by cross-attention. Each bone is represented by head and tail positions $F_1 \in \mathbb{N}^{N_f \times 3}$ 7, embedded as

$F_1 \in \mathbb{N}^{N_f \times 3}$ 8

Point features then attend to these bone features: $F_1 \in \mathbb{N}^{N_f \times 3}$ 9 where

$T_{1 \rightarrow t}$ 0

The paper is explicit that the skeleton is only used while training the deformation VAE. At inference time, the deployed system only needs the monocular video and the first-frame reconstructed mesh. A common misconception is therefore that Mesh4D requires skeletons at inference time; the paper states the opposite. The role of skeletal information is to shape the latent space during learning, especially for rigid or articulated parts, not to serve as a required conditioning signal during deployment. This training design also explains the ablation results: without skeleton information, rigid parts can twist incorrectly, and the full model improves IoU, P2S, Chamfer, and $T_{1 \rightarrow t}$ 1-Corr relative to the no-skeleton variant (Jiang et al., 8 Jan 2026).

4. Conditional deformation diffusion and inference

Once the deformation latent space has been learned, Mesh4D trains a deformation diffusion model to infer $T_{1 \rightarrow t}$ 2 from observations. The model predicts

$T_{1 \rightarrow t}$ 3

conditioned on the canonical mesh $T_{1 \rightarrow t}$ 4 and the full input video $T_{1 \rightarrow t}$ 5. This is built by extending the HY3D shape diffusion model with additional conditioning streams (Jiang et al., 8 Jan 2026).

The deformation diffusion learns a velocity field with a temporal embedding, a spatial embedding

$T_{1 \rightarrow t}$ 6

sampled from the canonical mesh, and additional attention layers in each DiT block to incorporate temporal information from the video and shape information from the canonical mesh. The video is encoded framewise using DINOv2 Giant features, and the latent tokens cross-attend to the corresponding frame features. The model also conditions on the high-dimensional canonical shape feature $T_{1 \rightarrow t}$ 7 from the static shape VAE.

The sampling process mirrors the shape model: the method starts from Gaussian noise and solves a first-order Euler ODE for 50 steps to reach the final latent. The supplementary inference details state that the model first segments the foreground moving object, reconstructs the canonical shape from the first frame using Hunyuan3D 2.1 with one input view, then performs 50 Euler ODE steps conditioned on the canonical shape latent and image features from all frames, and finally decodes the deformation latent into the per-vertex deformation field. Inference is therefore feed-forward apart from the diffusion-style denoising trajectory; there is no test-time optimization over geometry or motion.

The paper’s practical inference pipeline is:

segment the moving foreground object with a pretrained segmentation model;
resize and crop similarly to training;
use the first frame $T_{1 \rightarrow t}$ 8 as input to Hunyuan3D 2.1 to reconstruct the canonical textured mesh $T_{1 \rightarrow t}$ 9;
extract canonical shape latent $1$0 and FPS-based spatial embedding from $1$1;
extract framewise video features from all input frames using DINOv2 Giant;
run the deformation diffusion for 50 Euler ODE steps from Gaussian noise to obtain the deformation latent $1$2;
decode $1$3 with the deformation decoder to recover the per-vertex deformation field $1$4;
form all deformed meshes $1$5 by adding the decoded displacements to $1$6.

This design is closely related to the broader class of canonical-mesh deformation methods. Motion 3-to-4 likewise anchors dynamic reconstruction to a canonical reference mesh and predicts per-frame trajectories rather than independent shapes, but frames the problem as 3D motion reconstruction for 4D synthesis from a monocular video and an optional 3D reference mesh (Chen et al., 20 Jan 2026). Mesh4D’s specific distinction is the compact latent space that encodes the entire animation sequence in a single pass and the privileged skeleton-guided training of that latent (Jiang et al., 8 Jan 2026).

5. Benchmark design, quantitative results, and ablations

Mesh4D is trained and evaluated on a new dynamic-object benchmark derived from Objaverse-1.0. Starting from the curated animated subset released by Diffusion4D, the authors extract skeletons, skinning weights, and corresponding mesh sequences, then filter by vertex and bone complexity, leaving about 9k instances. Each instance is rendered as a frontal video of up to 100 frames. For testing, they create a disjoint benchmark of 50 animated mesh sequences with significant motion and high-quality textures. Each test sequence is rendered from four fixed azimuths $1$7: one view is used as input and the other three for novel-view synthesis evaluation (Jiang et al., 8 Jan 2026).

Geometry is evaluated with volumetric IoU, point-to-surface distance (P2S), Chamfer distance, and tracking with $1$8-Corr. Novel-view synthesis is evaluated with PSNR, SSIM, LPIPS, CLIP similarity, and FVD. The baselines are Hunyuan3D 2.1 run independently per frame with shared sampled noise, L4GM, and GVFD.

Method	Geometry	Tracking
HY3D	IoU 0.3071, P2S 0.0376, Chamfer 0.0370	—
L4GM	P2S 0.0459, Chamfer 0.0505	—
GVFD	P2S 0.0345, Chamfer 0.0378	$1$9-Corr 0.0514
Mesh4D	IoU 0.3731, P2S 0.0287, Chamfer 0.0273	$t$ 0-Corr 0.0384
Mesh4D (Aligned)	IoU 0.3949, P2S 0.0261, Chamfer 0.0243	$t$ 1-Corr 0.0338

For novel-view synthesis, the reported results are:

Method	Appearance	Temporal consistency
HY3D	PSNR 19.14, SSIM 0.8976, LPIPS 0.1195, CLIP 0.9174	FVD 692.2
L4GM	PSNR 18.07, SSIM 0.8939, LPIPS 0.1453, CLIP 0.8954	FVD 747.3
GVFD	PSNR 17.31, SSIM 0.8912, LPIPS 0.1459, CLIP 0.8802	FVD 905.0
Mesh4D	PSNR 19.67, SSIM 0.9018, LPIPS 0.1087, CLIP 0.9141	FVD 601.9
Mesh4D (Aligned)	PSNR 19.88, SSIM 0.9030, LPIPS 0.1052, CLIP 0.9141	FVD 572.7

The ablations are particularly informative. For the deformation VAE, using the ground-truth first-frame mesh as canonical input, the full model improves over both the “without temporal/global attention” and “without skeleton information” variants. The reported results are:

without temporal/global attention: IoU 0.6328, P2S 0.0153, Chamfer 0.0113, $t$ 2-Corr 0.0160
without skeleton information: IoU 0.6704, P2S 0.0148, Chamfer 0.0107, $t$ 3-Corr 0.0138
full model: IoU 0.7039, P2S 0.0144, Chamfer 0.0099, $t$ 4-Corr 0.0117

For the deformation diffusion, pretraining matters strongly. Training without HY3D initialization gives IoU 0.0819, P2S 0.0954, Chamfer 0.0854, $t$ 5-Corr 0.2063; with pretrained weights, the same metrics become 0.3433, 0.0327, 0.0308, and 0.0601. The paper also reports that classifier-free guidance does not help: without CFG, the model is slightly better than with CFG. These results support two of the paper’s broader claims: that a large pretrained 3D reconstruction prior is crucial because high-quality 4D datasets are limited, and that sequence-level motion modeling improves both geometry and rendering consistency (Jiang et al., 8 Jan 2026).

6. Position in the 4D mesh literature, misconceptions, and limitations

Mesh4D belongs to the family of canonical-mesh deformation methods. It shares with V2M4 the goal of producing usable mesh-centric 4D assets from monocular input, but the two methods differ sharply in formulation. V2M4 is an inference-time, optimization-based system built on TRELLIS framewise generation, camera search and mesh reposing, pairwise registration, and global texture optimization; Mesh4D instead is a feed-forward latent model that predicts the full animation in one shot from the video and the first-frame mesh (Chen et al., 11 Mar 2025). This places Mesh4D closer to learned 4D reconstruction and tracking than to post-hoc asset consolidation.

A second comparison is with methods such as DreamMesh4D and Motion 3-to-4. DreamMesh4D begins with a coarse mesh obtained through an image-to-3D generation procedure and drives motion through sparse control points and a geometric skinning algorithm in a Gaussian-mesh hybrid representation (Li et al., 2024). Motion 3-to-4 likewise decomposes 4D synthesis into static 3D shape generation and dynamic motion reconstruction from a canonical reference mesh (Chen et al., 20 Jan 2026). Mesh4D’s distinctive contribution within this lineage is the compact latent that encodes the entire animation sequence in a single pass, learned by a deformation VAE whose latent space is shaped by skeleton-guided privileged information (Jiang et al., 8 Jan 2026).

The main limitations follow directly from the representation. Mesh4D assumes a single moving object in monocular video, a fixed mesh topology across time, training data with corresponding animated meshes, and skeleton annotations available during training for the VAE. Skeletons are not needed at inference, but they are needed to learn the motion prior. The method also relies heavily on a good canonical mesh from the first frame. If the canonical reconstruction is wrong, the whole 4D sequence inherits that error. The model cannot represent topological changes, and the paper states that it struggles on extremely non-rigid objects (Jiang et al., 8 Jan 2026).

These limitations are not merely implementation details; they define the boundary of the method’s applicability. Helix4D, which explicitly positions itself against deformation-based methods such as Mesh4D, argues that canonical-mesh tracking approaches cannot naturally model topology changes, suffer from vertex sticking when later frames require fused regions to separate, and are limited on emerging objects and inner surfaces (Yenphraphai et al., 25 May 2026). This does not negate Mesh4D’s contribution; it clarifies that Mesh4D is strongest when topology is stable and dense correspondence over time is the main objective, rather than topology-changing dynamic mesh generation.

In that sense, Mesh4D marks a specific point in the evolution of 4D mesh methods. It shifts monocular dynamic reconstruction away from framewise geometry guessing and toward structured 4D reconstruction with explicit correspondences, while still inheriting the classical fixed-topology trade-off of canonical deformation models (Jiang et al., 8 Jan 2026).