Geometry-Guided Generative Inbetweening

Updated 23 September 2025

Geometry-guided generative inbetweening is a paradigm that leverages explicit geometric signals (e.g., depth, normals, segmentation) to generate intermediate frames with high spatial fidelity.
It employs diverse encoding strategies—from CNN-based latent injection to Transformer-driven graph matching and biomechanical constraints—to ensure precise motion and structural coherence.
Applications span video interpolation, 3D motion transitions, and face animation, demonstrating enhanced stability and improved metrics such as PSNR, SSIM, and Chamfer Distance.

Geometry-guided Generative Inbetweening (GGI) is a methodological paradigm in image, video, motion, and 3D scene synthesis that leverages explicit geometric signals—such as depth, normal maps, structural graphs, or flow fields—to guide the generative process of creating intermediate states between keyframes. This approach enforces geometric consistency, provides precise motion or structural control, and can be implemented in diverse frameworks ranging from convolutional neural networks to transformer pipelines and generative autoencoders. GGI contrasts with appearance-only or naive interpolation methods by ensuring that intermediate content precisely respects the underlying spatial layout and physical constraints, leading to temporally stable and visually coherent inbetween outputs across scenes, objects, and structured data.

1. Foundational Principles of GGI

Geometry-guided generative inbetweening relies on the integration of geometric priors at multiple levels of generative architectures. Inputs may include 2D or 3D geometry proxies (normals, depth, segmentation, material labels), motion flow fields, or topological encodings derived from scene representations. A canonical example is the Geometry to Image Synthesis (GIS) framework, which builds the mapping:

$\mathcal{R}: \{\text{Normal Map}, \text{Depth Map}, \text{Segmentation}, \text{Material}, I_{bg}\} \rightarrow I$

where the generator is conditioned on upsampled and rescaled geometry channels at each convolutional layer, incrementally refining the image and enforcing geometric consistency. Losses combine perceptual, adversarial, and diversity terms:

$\mathcal{L}^P = \sum_{l} S_l \cdot \lambda_l \cdot \| V_l(I_t) - V_l(I_s) \|_1$

for perceptual alignment, and multi-choice penalties to produce $K$ diverse outputs. The GIS approach is agnostic to detailed appearance parameters, relying primarily on geometric cues (Alhaija et al., 2018).

2. Geometric Encoding Strategies

GGI systems encode geometry through varied means:

End-to-end latent and feature injection: Video diffusion and GAN-based models concatenate encoded geometric maps and latent motion codes at each step, allowing the generative process to adapt to precise structural cues.
Graph and topological fusion: In cartoon and line art animation (AnimeInbet), raster drawings are vectorized as endpoint graphs, and inbetweening is formulated as a graph matching and fusion problem. Intermediate vertex embeddings and correspondence are computed via Transformer modules with self- and cross-attention operations:

$F_0 = \mathcal{E}_I(I_0)[V_0] + \mathcal{E}_P(V_0) + \mathcal{E}_T(\mathcal{S}(T_0))$

with fusion by interpolating matched vertices and propagating shifts for unmatched vertices (Siyao et al., 2023).

Biomechanical constraints: In motion tweening for 3D bodies, explicit range-constrained forward kinematics (RC-FK) map rotational outputs into physically-valid joint positions, preserving constant bone lengths and anatomical plausibility. The system first predicts joint rotations, passes them through RC-FK, and subsequently estimates the root trajectory via global velocity prediction (Zhou et al., 2020).
Flow and camera trajectory conditioning: Flow-aware video interpolation models (MoG, VideoFrom3D) concatenate intermediate flow fields and warped noise maps with latent representations. Camera motion is encoded via latent warping operations to synchronize generative outputs with input trajectories. These are used both for artifact correction and motion stabilization in the denoising process (Zhang et al., 7 Jan 2025, Kim et al., 22 Sep 2025).

3. Model Architectures and Conditioning

GGI architectures range from convolutional encoder–decoders and U-Net style diffusion models to Transformer-based correspondence solvers:

Framework	Geometry Input	Conditioning Mechanism
GIS (CNN)	Normals, depth, material	Concatenation at multi-scale layers
AnimeInbet (Graph+Transformer)	Vertex graphs, topology	Geometric embedding + Transformer
MoG (VFI+Generative)	Intermediate flow, occlusion masks	Dual (latent & feature) injection
Face Animation GAN	Depth/normal, landmarks	Ensemble discriminators, rendering
VideoFrom3D (Diffusion)	Anchor-view latents, HED edge maps, flow	Channel concatenation in latent space

Attention mechanisms (extended self-attention, cross-attention) and hybrid loss functions (perceptual, adversarial, "best-of-K," RL-inspired fusion) are commonly used to enforce geometric adherence during sampling and synthesis.

4. Applications and Performance Impact

GGI finds applications in:

Video interpolation and animation: Synthesis of temporally and geometrically stable frames in video interpolation, especially in cases with large or ambiguous motion gaps. FCVG demonstrates this with frame-wise line correspondences driving explicit conditions for each intermediate frame, resulting in consistently stable and coherent output even for 23-frame gaps (Zhu et al., 16 Dec 2024).
3D motion and human pose transitions: Long-term inbetweening in complex human motions, preserving biomechanical plausibility and style via Motion DNA, yielding low root translation errors (0.7 cm/frame, 5.5 cm over 128 frames) (Zhou et al., 2020). Geometry distribution models for 3D human synthesis deliver substantial improvements in Chamfer and FID scores over SOTA baselines (Tang et al., 3 Mar 2025).
Cartoon and line art animation: Geometric graph-based inbetweening outperforms pixel-based interpolation methods (e.g., FILM, RIFE) in scenarios with large vertex displacements, robustly preserving line structures as measured by Chamfer Distance (Siyao et al., 2023).
Face animation: G3FA leverages 3D facial geometry cues from inverse rendering to produce high-fidelity talking head reenactment, improving resilience to large head rotations and preserving identity under adversarial, perceptual, and geometric losses (Javanmardi et al., 23 Aug 2024).
3D scene video generation: VideoFrom3D fuses anchor-view appearance with geometry-guided video inbetweening using flow-aware noise maps and HED-based edge structural guidance, surpassing prior methods in PSNR and SSIM across challenging camera trajectories (Kim et al., 22 Sep 2025).

5. Mathematical and Optimization Foundations

GGI frameworks often formulate the generative objective as constrained optimization over geometry-conditioned paths:

Manifold and metric learning: Geometry-aware autoencoders (GAGA) learn warped Riemannian metrics via pullback from latent space to faithfully sample points and compute geodesics:

$f^+(x) = [f_\theta(x); \beta s(x)]$

with geodesic energy minimization for one-point and population-level trajectories:

$\mathcal{L}_{geo}(\eta, x_0, x_1) = \frac{1}{M} \sum_{m} g_{\mathbb{R}^n}(\dot{c}_n, \dot{c}_n)(t_m)$

and flow matching for transporting distributions (Sun et al., 16 Oct 2024).

Diffusion sampling and guidance: In keyframe video interpolation, temporal self-attention maps are rotated to produce coherent backward motion, and predictions from both temporal directions are fused:

$\hat{z}_t = \frac{f_\theta(z_t; t, c_0) + \text{flip}(f_{\theta'}(\text{flip}(z_t); t, c_{N-1}, \{A'_i\}))}{2}$

ensuring forward-backward motion consistency (Wang et al., 27 Aug 2024).

6. Challenges, Limitations, and Open Questions

While GGI methods deliver high-fidelity and temporally stable inbetweening, several challenges remain:

Domain gap in geometric guidance: Many models train on simulated or synthetic geometry, which may not exactly match inference-time scene structures, necessitating robust encoding strategies (VideoFrom3D uses simulated HED maps in training to mitigate this gap) (Kim et al., 22 Sep 2025).
Ambiguity in geometric correspondence: In line art animation and video synthesis, sparse or misaligned geometric correspondences can result in incomplete or distorted outputs. The reliability of feature and correspondence extraction (e.g., line matching robustness in FCVG and AnimeInbet) directly impacts performance (Siyao et al., 2023, Zhu et al., 16 Dec 2024).
Temporal and population-level consistency: Ensuring that manifold-based interpolations respect not only geometric feasibility but also coherent population transport (e.g., in cell differentiation trajectories) requires effective metric warping and flow matching, but increases training complexity and sensitivity to hyperparameters (Sun et al., 16 Oct 2024).
Artifact correction in complex motion regions: Even with explicit geometry guidance (MoG), corrections for artifact-prone regions depend on the adaptive refinement capacity of the generative pipeline and may face limitations as motion complexity scales (Zhang et al., 7 Jan 2025).

7. Future Directions

Advances in geometry-guided generative inbetweening are expected in several areas:

Multi-modal geometric encoding: Integrating depth, normals, flow, segmentation, and graph-based features within unified generative architectures to robustly handle multimodal input cues.
Adaptive guidance and self-supervised correction: Models that dynamically adjust geometric conditioning based on local reliability, possibly using adversarial or contrastive objectives to enforce alignment in ambiguous regions.
Higher-dimensional and population modeling: Extensions to 4D (space-time) synthesis where geometry must be tracked and evolved across independent fragments—e.g., piecewise Gaussian spline modeling and hierarchical motion fragmentation (Nag et al., 11 Apr 2025).
Scalability to real-world workflows: Methods that minimize reliance on paired datasets, as in VideoFrom3D, or that harmonize 3D graphics pipelines with generative models (I2V3D) for production-ready applications (Kim et al., 22 Sep 2025, Zhang et al., 12 Mar 2025).

Geometry-guided generative inbetweening thus establishes a rigorous foundation for controllable, physically plausible, and high-fidelity synthesis of intermediate states across a broad spectrum of domains, leveraging explicit geometric cues and advanced generative architectures.