GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Published 9 Apr 2025 in cs.CV | (2504.07083v2)

Abstract: Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.

Abstract PDF Chat (Pro)

Summary

GenDoP: Auto-Regressive Camera Trajectory Generation as a Director of Photography

The paper "GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography" introduces GenDoP, an innovative approach that enhances camera trajectory design for video production using an auto-regressive model. The research defines GenDoP as a tool inspired by the expertise of Directors of Photography, focusing on generating artistic trajectories to align closely with directorial intent. The introduction of the DataDoP dataset, a large-scale, multi-modal repository, supports the training of GenDoP and serves as a novel resource for advancing learning-based cinematography.

Contribution and Methodology

In addressing existing limitations, the authors highlight insufficiencies in traditional and contemporary trajectory generation methods, which often suffer from procedural rigidity or structural biases. GenDoP leverages an auto-regressive, transformer-based model that departs from geometric optimizations and procedural constraints, allowing for more expressive and creatively aligned outputs.

DataDoP Dataset: The authors introduce DataDoP, comprising 29K real-world shots tagged with motion categories and equipped with RGBD inputs and directorial captions. These captions cater to intricate camera movements, scene interactions, and intent, providing a comprehensive, context-rich dataset for training the GenDoP model.

GenDoP Model: The core of the method is an auto-regressive, decoder-only transformer that models camera movements as discrete tokens and aligns trajectory generation with textual guidance. Experiments conducted showcase GenDoP's proficiency in refining trajectory detail, enhancing motion stability, and offering improved controllability over previous models like CCD, E.T., and Director3D, particularly when trained with the DataDoP dataset.

Evaluation and Results

The evaluation criteria focused on text-trajectory alignment and trajectory quality. The approach demonstrated superiority in control and precision, evidenced by the following metrics:

Improved CLaTr-CLIP scores, indicating enhanced alignment between textual instructions and the generated trajectories.
Lower CLaTr-FID scores compared to pre-existing diffusion models, reflecting better quality in trajectory synthesis.

The research further emphasized the robustness of GenDoP in generating consistent and stable paths, reducing trajectory-level noise and jitter, often observed in non-autoregressive models.

Implications and Future Directions

Practical Implications: GenDoP sets a new benchmark for integrating advanced generative models into cinematography, facilitating automation in trajectory design with applications in both text-to-video and image-to-video contexts. Its ability to generate stable, complex trajectories that unambiguously convey artistic intent makes it a valuable tool for filmmakers.

Theoretical Implications: The work highlights the flexibility of auto-regressive models in tasks traditionally dominated by procedural approaches. This serves as a proof of concept for exploring similar methods in other generative tasks, suggesting a shift towards leveraging adaptive, context-aware models.

Speculative Future Developments: One potential area of expansion could involve further integrating multi-modal data, like 4D point clouds, into GenDoP, enriching the generative process with deeper spatial understanding. Additionally, the evolution of unified pipelines for trajectory and video content generation could expedite film production, ushering in more sophisticated AI-driven filmmaking tools.

The researchers' exploration of an under addressed area in AI cinematography positions GenDoP as a significant step in bridging technical advancements with artistic expression. As techniques continue to mature, they are poised to redefine workflows in digital media creation.