InverseCrafter: Inverse Synthesis for Video & CAD

Updated 12 December 2025

InverseCrafter is a framework that casts content synthesis as an inverse problem, recovering structured and editable digital artifacts from partial observations.
It leverages latent-space encoding, pretrained VAEs, and a decomposed diffusion sampler to balance generative priors with measurement consistency.
The method achieves high-fidelity video recapture, inpainting, and CAD reverse engineering, demonstrated by competitive metrics in PSNR, LPIPS, and SSIM.

InverseCrafter is a class of computational methods that formulate content synthesis or recovery as an inverse problem—specifically, the recovery of structured, editable or semantically coherent digital artifacts from incomplete, modified, or externally specified observations. While the term spans diverse problem domains, recent usages focus on video recapture, controllable video inpainting, and 3D or CAD assembly inference, typically leveraging learned generative models or geometric priors to address underdetermined inverse mappings. Notable contemporary variants include the latent-domain video inpainting solver "InverseCrafter" (Hong et al., 5 Dec 2025), fabrication-aware CAD reverse engineering (Noeckel et al., 2021), and feedforward avatar parameter inversion (Wang et al., 3 Mar 2025). The unifying attribute is inversion from outputs or measurements (images, partial observations, degraded data) back to editable, controllable, or parameterized latent sources.

1. The Latent-Space Inverse Problem Formulation

The foundational paradigm is to treat synthesis as the solution to an inverse problem, wherein available measurements (such as a masked view of video frames, partial CAD object geometry, or images of an avatar) are related to latent variables or controllable parameters through a known or learned forward process. For instance, in "InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem" (Hong et al., 5 Dec 2025), 4D video generation (novel-view synthesis or video recapture) is cast as inferring the underlying latent representation $z$ of the full video $x$ given observed projections $y = x \odot M_{\mathrm{pix}}$ , where $M_{\mathrm{pix}}$ is a spatiotemporal binary mask indicating measured pixels. The primary challenge is reconstructing $z$ under constraints induced by $y$ —i.e., estimating $p(z\,|\,z_{y},m)$ , with $z_{y} = z\odot m$ , where $m$ is a continuous multi-channel mask defined in latent space.

The general workflow is:

Encode observed measurements into latent space via a pretrained VAE $\mathcal{E}$ : $z = \mathcal{E}(x)$ , $z_{y} = \mathcal{E}(y) = z \odot m$ .
Reformulate the inverse recovery as a constrained diffusion process or energy minimization in the latent space.
Alternate between steps favoring the learned generative prior and those enforcing measurement consistency, converging to a solution compatible with both the prior and the observed data.

This latent-domain approach minimizes pixel-space encoder/decoder overhead, allows for efficient iterative optimization, and enables seamless integration of additional semantic guidance (e.g., text prompts, editable parameters).

2. Optimization Methodology and Data-Consistency Mechanisms

A central contribution of InverseCrafter (Hong et al., 5 Dec 2025) is its adaptation of the Decomposed Diffusion Sampler (DDS) to balance generative prior adherence with measurement fidelity at each diffusion step. The optimization step at each timestep $t$ consists of minimizing

$\min_{z}\;\frac{\gamma}{2}\|z - z_{y}\odot m\|^{2}_{2} + \frac{1}{2}\|z - \hat{z}_{0|t}\|_{2}^{2},$

where $\hat{z}_{0|t}$ is the denoised Tweedie estimate from the conditional flow model $\mathcal{F}^{\theta}$ . The closed-form solution can be achieved via conjugate-gradient methods due to the simple diagonal structure of the masking operator $H = \mathrm{diag}(m)$ .

This formulation ensures that the updated latent $z$ at each step remains close to both the denoised generative output and the masked latent observations, with the trade-off controlled by the hyperparameter $\gamma$ . The data-consistency step is performed only for early diffusion timesteps, as scheduled by a set $\Gamma$ (typically the initial $60\%$ of steps, with $\alpha = 0.6$ ).

An essential innovation is the construction of a continuous, multi-channel latent mask $m$ via a dedicated encoder $\mathcal{P}_{\phi} : \mathbb{R}^{FHW} \to \mathbb{R}^{Cfhw}$ , resulting in superior measurement transfer compared to naïve channel-agnostic downsampling. This encoder can be trained using ground-truth mask targets derived by differencing clean and masked latent encodings or approximated on the fly at inference time to enable a fully training-free workflow.

3. End-to-End Pipeline and Algorithmic Structure

The full InverseCrafter pipeline encompasses three interactive processes: encoding, inpainting/sampling in latent space, and decoding:

Mask Generation: Either a learned $\mathcal{P}_\phi$ or a training-free construction based on VAE encoder difference is used to create the continuous latent mask $m$ from the pixel mask $M_{\mathrm{pix}}$ .
Measurement Encoding: The observed measurement $y$ is encoded into latent space $z_{y} = \mathcal{E}(y)$ ; this, together with the mask $m$ , defines the measurement constraint.
Iterative Inverse Solving: Starting from an initial latent $z_{T}$ $z_{T}$ (typically $z_{y}$ $z_{y}$ ), the algorithm alternates:
- Tweedie denoising (posterior mean estimation via $\mathcal{F}^\theta$ ),
- Data-consistency proximal update (conjugate gradient step enforcing $z \approx z_y$ in observed regions),
- ODE integration of the generative flow.
Decoding: After all diffusion steps, the final latent $z_0$ is decoded via $\mathcal{D}$ to produce the edited or synthesized output video.

Notably, the process avoids backpropagation through the generative model and repeated VAE decoding, yielding both computational efficiency and minimal memory overhead. Sampling runtime is competitive with unconditional generation (71s vs 70s in benchmarks reported on the UltraVideo dataset).

4. Application Domains: Video Recapture and Beyond

InverseCrafter's core architecture is highly generalizable across video-based tasks requiring the imposition of measurement or user constraints in the generative process. Principal applications include:

Video Camera Control and Recapture: Given a source video $x_{\mathrm{src}}$ and a new camera trajectory $\{T_i\}$ , InverseCrafter synthesizes a novel-view sequence $x_{\mathrm{tgt}}$ by projecting and inpainting in latent space.
Text-Guided Video Inpainting: By incorporating text prompts or free-form editing, the system can synthesize temporally coherent inpainted content while preserving consistency with user-specified regions.
General-Purpose Video Editing: The underlying architecture is applicable to a broad range of video inpainting tasks, including object removal, motion editing, and novel object insertion.

The algorithm is compatible with both learned and training-free mask inference, supports multiple VAE backbones (with retraining of $\mathcal{P}_\phi$ as needed), and can flexibly incorporate additional modalities or constraints.

5. Quantitative Results and Qualitative Analysis

Extensive experiments reported in (Hong et al., 5 Dec 2025) establish InverseCrafter's efficacy in both measurement consistency and perceptual quality when compared to contemporaneous video generation and editing systems:

UltraVideo Camera Control: Achieves PSNR of 28.95, LPIPS of 0.0529, SSIM of 0.867, FVD of 102.5, and VBench of 0.897 in the learned-mask mode (n=1,000 clips). The training-free variant yields marginally lower but still state-of-the-art results, with near-zero computational overhead.
Text-Guided Inpainting (DAVIS): Delivers FVD of 1423 and VLM of 0.651, outperforming VideoPainter [SIGGRAPH’25], TrajCrafter [CVPR’25], and NVS-Solver [arXiv’24] in measurement preservation and semantic relevance.
Qualitative Observations: Maintains temporal and spatial coherence even in occluded or novel regions, preserves background consistency, and demonstrates semantically plausible content insertions under text or measurement control prompts.

The following table summarizes key camera-control results on UltraVideo:

Method	PSNR	LPIPS	SSIM	FVD	Runtime (s)
TrajCrafter	28.37	0.0573	0.894	120.0	134
NVS-Solver	26.68	0.0816	0.831	96.9	696
Ours (learned)	28.95	0.0529	0.867	102.5	71
Ours (train-free)	28.81	0.0534	0.865	100.2	71

6. Strengths, Limitations, and Extensions

Strengths:

Zero-shot deployment: no model fine-tuning on new videos or target domains.
Measurement consistency and high-fidelity recapture without compromising generative prior quality.
Low computational overhead; supports both learned and inference-time mask estimation.

Limitations:

Overall runtime dictated by the multi-step diffusion process, slower than feedforward generators.
Potential quality degradation if monocular depth estimation used for warping is inaccurate.
Mask encoder $\mathcal{P}_{\phi}$ must be retrained for each new VAE backbone.
Output inherits any spectral or semantic biases of the base generative video model.

Proposed Extensions:

Replacing iterative diffusion with distilled or consistency models to accelerate inference.
Integration of more accurate, multi-view geometric priors at the warping stage.
Joint optimization of depth maps and latent masks for improved robustness in challenging scenes.

7. Architectural and Implementation Details

InverseCrafter employs a modified Wan2.1-based 3D VAE (1.5M params, 240×416), a lightweight CNN-based mask encoder $\mathcal{P}_\phi$ , and a flow-based video diffusion model (Wan2.1-Fun-V1.1-1.3B-InP) at 480×832 resolution. Mask encoder training utilizes 46,500 samples from VidSTG with AdamW (lr= $10^{-4}$ , wd= $3\times10^{-2}$ ), batch size 16 (4×A6000 GPUs, 1 day). Conjugate gradient in the data-consistency step uses $K=5$ iterations; schedule $\Gamma$ covers initial $60\%$ timesteps.

Inference leverages DPM-Solver for ODE-based latent flow integration. No backpropagation or pixel-space decoding/encoding is performed after measurement encoding, enabling computational efficiency and memory savings (Hong et al., 5 Dec 2025).

InverseCrafter represents an overview of latent-space inverse problem methodology, instantiated in high-fidelity video generation and editing systems that emphasize measurement consistency, semantic controllability, and computational tractability. Its methodological innovations—including the use of per-channel latent masks and efficient data-consistency proximal steps—are directly applicable to a broad swath of inverse tasks in computer vision and graphics.