- The paper demonstrates that integrating structural anchoring, motion continuity, and layered blending enables zero-shot video transitions with high motion consistency and perceptual quality.
- It employs a three-stage process involving feature extraction, hierarchical B-spline interpolation, and diffusion-based synthesis conditioned on explicit edge maps.
- Quantitative evaluations and user studies confirm that SAGE outperforms classical and generative baselines in motion coherence and visual fidelity.
SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
Introduction and Motivation
The SAGE framework addresses the challenge of synthesizing visually coherent and temporally smooth video transitions between diverse clips, where conventional methods such as cross-fades, morphing, and frame interpolation often fail due to semantic and structural disparities. The method is motivated by artistic heuristics—structural anchoring, motion continuity, and layered blending—distilled from manual workflows in professional video editing. These principles guide the design of SAGE, which fuses explicit structural and motion cues with generative synthesis, enabling zero-shot transitions without the need for task-specific fine-tuning or curated training data.
Figure 1: Artist-designed transitions illustrate the heuristics that inspire SAGE: structural anchoring, motion continuity, and layered blending.
Methodology
SAGE operates in three stages. First, it extracts structural features (line segments via GlueStick), motion features (optical flow via SEA-RAFT), and layer features (foreground masks via SAM) from the boundary frames of the input clips. Line segments encode dominant contours and silhouettes, while optical flow captures local motion cues aligned with these structures. Foreground masks isolate salient regions, ensuring that structural guidance focuses on perceptually relevant content.
Layer-aware Line Matching and Motion-aware Interpolation
Foreground lines are selected and normalized within canonical bounding boxes to facilitate robust matching. Hungarian matching is employed to establish one-to-one correspondences between foreground lines, using a cost matrix based on line centers (with potential extensions to orientation and length). This matching avoids background dominance and ensures semantic relevance.
To interpolate between matched structures, SAGE introduces a hierarchical scheme: global foreground trajectories are computed using cubic B-splines, guided by average flow vectors, while local line blending is performed in the canonical space and mapped back via the evolving bounding box. This two-scale approach enforces both global motion smoothness and local structural consistency, mitigating artifacts such as line crossings and structural collapse that arise in naïve linear interpolation.
Conditional Generative Synthesis
The interpolated line sets are rasterized into edge maps and used to condition a pretrained diffusion-based inbetweening model (e.g., Generative Inbetweening). ControlNet-style conditioning injects these edge maps alongside the boundary frames, enabling the synthesis of temporally smooth and semantically adaptive transitions in a zero-shot setting.
Evaluation
Qualitative Results
SAGE demonstrates superior performance on a diverse set of video transitions, including complex changes in scene scale, object category, and motion direction. Compared to classical and generative baselines (FILM, TVG, DiffMorpher, Generative Inbetweening, VACE), SAGE consistently maintains motion coherence, foreground object integrity, and background consistency, even in challenging scenarios with large semantic gaps.
Figure 2: Qualitative results on diverse video clips, showcasing SAGE's performance on complex transitions in scene scale, object category, and motion direction.
Figure 3: Qualitative comparison with baseline methods, demonstrating SAGE's superior consistency in motion, foreground objects, and background scenery.
Quantitative Results
SAGE achieves the highest flow similarity to ground truth artist-designed transitions, validating its motion consistency. It also secures competitive FID and FVD scores, indicating strong perceptual quality. Notably, some baselines (e.g., GI, TVG) optimize for image/video metrics at the expense of motion adherence, resulting in abrupt or implausible transitions. SAGE balances both aspects, producing transitions that are both visually and temporally coherent.
User Study
A user study with 26 participants confirms a strong preference for SAGE over all baselines across criteria including temporal consistency, plausibility, motion complexity, and overall preference. SAGE is preferred in over 80% of cases, with particularly high scores against DiffMorpher and Generative Inbetweening.
Ablations and Limitations
Ablation studies reveal that removing either structural or motion guidance significantly degrades transition quality, underscoring the complementarity of these cues. B-spline interpolation is especially beneficial for diverse clip pairs with misaligned motion trajectories. Failure cases arise when the generative backbone is biased (e.g., pretrained on human poses), leading to hallucinated content, or when clips lack salient linear features, resulting in unreliable correspondences. The method also does not explicitly model appearance blending, which can cause discontinuities in texture-rich regions.
Practical Implications and Future Directions
SAGE provides a practical tool for video editing, enabling content-aware transitions in creative workflows where collecting training data is infeasible. The zero-shot nature and reliance on pretrained generative models make it readily deployable in professional and consumer applications. Theoretically, SAGE demonstrates the efficacy of combining explicit structural priors with generative synthesis, opening avenues for structure-aware video generation.
Future work may incorporate semantic cues (e.g., Dino features), higher-order correspondence costs, and appearance-aware generation to further enhance robustness and perceptual quality. Integrating local smoothness priors and adaptive blending strategies could extend SAGE to richer generative workflows, supporting even more complex transitions.
Conclusion
SAGE introduces a principled framework for structure-aware generative video transitions between diverse clips, leveraging artist-inspired heuristics and hierarchical structural guidance. The method achieves strong quantitative and qualitative results, outperforming both classical and generative baselines in motion consistency and perceptual quality. SAGE's design and evaluation establish a foundation for future research in content-aware video synthesis and editing, with promising directions for semantic integration and appearance modeling.