TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models (2503.05638v1)

Published 7 Mar 2025 in cs.CV, cs.AI, and cs.GR

Abstract: We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.

Summary

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

This paper discusses TrajectoryCrafter, an innovative framework designed to enable redirection of camera trajectories in monocular videos. The proposed method leverages diffusion models to achieve precise and coherent 4D content generation while maintaining control over user-specified camera trajectories. This approach addresses the limitations of traditional reconstruction-based methods that often face challenges with occluded regions and dependency on synchronized multi-view datasets, which are typically impractical for ordinary users.

Key Contributions

Dual-Stream Conditional Video Diffusion Model: The paper introduces a novel dual-stream video diffusion model that concurrently manages deterministic view transformations and stochastic content generation. By integrating point cloud renders and source videos as conditional inputs, this model ensures accurate view synthesis and consistency across dimensions.
Data Curation Strategy: TrajectoryCrafter employs a unique dataset curation strategy that bypasses the need for extensive synchronized multi-view data, which is scarce. It combines web-scale monocular video data with static multi-view datasets. The authors introduce a double-reprojection technique to synthesize training data that fosters robust generalization across a wide variety of scenes.
Point Cloud Based View Transformation: The authors propose a technique to derive view transformations from dynamic point clouds. By lifting the source video into a point cloud through depth estimation and then rendering novel views based on the desired trajectory, TrajectoryCrafter achieves precise trajectory control while allowing for unresolved regions to be addressed through diffusion-based content generation.

Evaluation and Results

The research includes comprehensive evaluations using both synchronized multi-view datasets and large-scale monocular video datasets. Quantitative metrics such as PSNR, SSIM, and LPIPS are used alongside qualitative analyses to demonstrate the superior performance of TrajectoryCrafter over existing methods.

On synchronized multi-view datasets, TrajectoryCrafter outperformed baselines such as GCD, ViewCrafter, and Shape-of-motion in generating high-fidelity videos with precisely controlled trajectories.
The model's capacity to generalize across diverse scenes was further validated through extensive testing on the large-scale in-the-wild video benchmark, showcasing marked improvements in aesthetic and consistency-based metrics per the VBench protocol.

Implications and Future Directions

TrajectoryCrafter illustrates a significant advancement in the domain of view synthesis for monocular video, particularly in its ability to effectively balance the dichotomy between stochastic content generation and deterministic view transformations. This work points towards the potential for further exploration in AI-driven video synthesis, particularly through refining depth estimation techniques and improving inference speeds to mitigate the current computational demands.

Future research may focus on extending the framework to support larger-scale trajectory modifications, with improvements in handling depth inaccuracies to further enhance realism and applicability. Additionally, since diffusion models are computationally expensive, future iterations could explore optimization techniques or alternative models that retain quality while reducing computational overhead.

In conclusion, TrajectoryCrafter represents a pivotal step in democratizing high-fidelity video editing, making realistic camera trajectory modifications accessible without the constraints of traditional multi-view requirements. This capability underscores the broadening applicability of AI to creative and technical domains by leveraging sophisticated generative models like diffusion models.

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1899304291514393051

https://twitter.com/StefanABaumann/status/1899112958707290488

YouTube

Show All Videos