BulletTime: Decoupled Time & Pose Control

Updated 6 December 2025

The paper introduces a novel video diffusion framework that decouples world time and camera pose to allow independent control of scene dynamics and viewpoints.
It leverages 4D conditioning with rotary positional encoding and adaptive layer normalization to deliver precise spatiotemporal control and high-fidelity synthesis.
Experiments demonstrate BulletTime’s superior performance on metrics such as PSNR, SSIM, and LPIPS compared to traditional methods.

BulletTime is a 4D-controllable video diffusion framework that enables explicit and independent control of both scene dynamics ("world time") and camera viewpoint (6-DOF pose), thereby overcoming the fundamental coupling between temporal evolution and camera motion inherent in conventional video generation models. Through a combination of novel 4D conditioning mechanisms and a bespoke disentangled dataset, BulletTime achieves robust synthesis of videos whose scene content and camera trajectory can be manipulated independently and continuously, a property critical for applications such as visual effects, dynamic scene understanding, and view temporal synthesis (Wang et al., 4 Dec 2025).

1. Motivation and Problem Definition

Traditional video diffusion models entangle time progression (scene dynamics) and 3D camera motion. Such coupling inhibits precise spatial and temporal control, hindering use cases that require their independent manipulation, such as bullet-time effects, where an identical event is viewed simultaneously from multiple perspectives, or dynamic scene inspection with arbitrary viewpoint or retimed playback. BulletTime is introduced to address this constraint by explicitly disentangling world-time evolution and view trajectory in video synthesis (Wang et al., 4 Dec 2025).

Prior approaches for view and time synthesis, such as dynamic neural radiance fields (NeRFs) (Zhang et al., 2022), have incorporated explicit time and viewpoint arguments to enable arbitrary-time, arbitrary-view rendering, reinforcing the necessity of decoupled control in applications demanding continuous spatiotemporal manipulation.

2. Architectural Overview

BulletTime's architecture is centered on a latent-space diffusion transformer (DiT) denoising framework, underpinned by an encoder VAE that transforms an input video $X \in \mathbb{R}^{F \times H \times W \times 3}$ into latent tokens $\tilde{z}_t$ . The DiT model iteratively denoises the latent tensor, minimizing the objective

$L = \mathbb{E}_{z_t,\epsilon,t}[\|\epsilon - \epsilon_\theta(z_t, t, \tau, c)\|^2],$

where $\epsilon_\theta(\cdot)$ conditions explicitly on both the world-time sequence $\tau$ and camera pose sequence $c$ .

Each DiT block integrates:

3D self-attention over spatiotemporal tokens,
Parallel MLPs,
Two adaptive layer normalization branches (Time-AdaLN and Camera-AdaLN), independently modulating feature activations with respect to world-time and camera pose.

At inference, the user supplies:

A continuous world-time sequence $\tau = (\tau_0, \ldots, \tau_{F-1})$ , determining the physical temporal dynamics of output frames (e.g., enabling slow-motion, reversal, or non-uniform time steps).
A corresponding camera trajectory sequence $c = (c_0, \ldots, c_{F-1})$ , each $c_i$ a 6-DOF pose, encoded either as Plücker-ray embeddings $[u_i, v_i] \in \mathbb{R}^{2d_c}$ or as look-at parameters (center point, radius, azimuth, elevation), and subsequently summarized via a 2D convolutional encoder.

3. 4D Conditioning: Positional Encoding and Adaptive Normalization

BulletTime introduces two complementary mechanisms for 4D spatiotemporal conditioning:

3.1. 4D Rotary Positional Encoding (4D-RoPE)

Traditional rotary position embeddings (RoPE) are generalized to jointly encode continuous time and camera pose within the attention layer. Two block-diagonal operators are defined:

Time-RoPE: $D^{Time}(\tau) = \mathrm{diag}(R(\theta_1 \tau), \ldots, R(\theta_{d'})\tau))$ injects world-time as continuous positional signals ( $\theta_k = b^{-2(k-1)/d'}$ ).
Cam-RoPE: $D^{Cam}(c_i, c_j) = \mathrm{diag}(R(\phi_1(c_i, c_j)), \ldots, R(\phi_{d'}(c_i, c_j)))$ injects camera geometry using learned linear-frequency embeddings of relative pose or ray angles.

The fused 4D operator is: $D^{4D}_{ij} = D^{Time}(\tau_i - \tau_j)\, D^{Cam}(c_i, c_j).$ These operators modulate half of each query and key vector in the attention computation, yielding a time- and camera-aware attention mechanism that allows for explicit and differentiable control over both temporal and spatial dimensions.

3.2. Parallel Adaptive Layer Normalization (AdaLN)

Feature activations within each transformer block are modulated via two AdaLN pathways:

Time-AdaLN: Embeddings of $\tau_i$ are projected by two MLPs to produce scale ( $\gamma^t_i$ ) and shift ( $\beta^t_i$ ) vectors, modulating the layer-normalized activations.
Camera-AdaLN: Camera embeddings $c_{\text{embed},i}$ , produced by the 2D convolutional encoder, are similarly projected to ( $\gamma^c_i$ , $\beta^c_i$ ).

Modification is sequential or additive, depending on empirical stability, ensuring distinct and robust control signals for both axes.

4. Curated 4D-Controlled Training Dataset

The independent supervision of time and view is made possible by a bespoke dataset of approximately 20,000 synthetic videos generated with PointOdyssey and Blender. Each scene configuration encompasses:

80 unique environments (60 outdoor HDR, 20 indoor layouts),
100 human-like characters animated using mocap data,
For each scene: three camera trajectories (static, orbit, multi-waypoint) varying in radius (4–12 m), azimuth ( $\leq$ 75°), elevation ( $\leq$ 30°), and interpolated via smoothstep or linear paths,
Three temporal variants (linear, slow-motion, pausing, random time-warp, monotonic spline warp).

Videos are annotated with tuples $(\tau_i, c_i)$ per frame (81 frames at $384 \times 640$ resolution) and encoded into 3D-VAE latent tokens, supporting targeted and disentangled temporal/camera conditioning during training.

5. Training Protocol and Evaluation Metrics

Training leverages CogVideoX-5B-T2V as the base, with BulletTime's 4D modules fine-tuned end-to-end. Key training characteristics include:

Optimizer: AdamW (learning rate $2 \times 10^{-5}$ , weight decay $10^{-4}$ , linear warm-up, cosine/linear decay),
Progressive schedule: initial lower-resolution training ( $192 \times 320$ ), followed by full-resolution ( $384 \times 640$ ) fine-tuning,
Denoising loss: $L_2$ distance on predicted noise,
Both time and camera conditioning modules are active from fine-tuning onset.

Evaluation on a 500-scene synthetic test set and 100 real-world ViPE videos comprises:

Synthetic: PSNR, SSIM, LPIPS.
Real-world: Rotational/translational camera error (vs. MegaSAM), VBench (Aesthetic, Imaging, Flicker, Smoothness, Consistency), FVD, KVD.

Comparative Quantitative Results

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
TrajectoryCrafter*	17.72	0.492	0.343
ReCamMaster*	21.86	0.585	0.185
BulletTime (Ours)	24.57	0.691	0.127

Method	RotErr $\downarrow$	TransErr $\downarrow$	Flicker $\uparrow$	Smooth $\uparrow$	FVD $\downarrow$	KVD $\downarrow$
Traj.Crafter*	5.44	3.31	0.966	0.988	2399	150.2
ReCamMaster*	2.98	1.85	0.976	0.991	2325	146.1
BulletTime	1.47	1.32	0.978	0.992	2292	139.1

Ablation experiments show that Time-RoPE+Time-AdaLN components outperform alternatives (PSNR $\sim32$ vs. $\sim30$ without AdaLN), and full 4D-RoPE brings a 1.5 dB gain over camera-only conditioning. Channel-additive or cross-attention based time control yields 2–4 dB lower PSNR than AdaLN.

Dynamic scene rendering with decoupled temporal and view control has also been explored in dynamic NeRF-based methods (Zhang et al., 2022). Such models encode spatial position, time, and camera direction as independent variables, jointly mapping $(x,y,z) \times t \times d$ to color and density, and employ explicit regularization (e.g., temporal-consistency losses, camera parameter optimization, SuperSloMo frame interpolation) to enforce separability.

BulletTime extends this disentanglement paradigm to latent video diffusion architectures, directly enabling video synthesis with continuous, arbitrary temporal and spatial conditioning at high fidelity and with improved flexibility compared to prior NeRF or video generation approaches.

7. Limitations and Directions for Future Research

BulletTime's principal reliance on synthetic data introduces distributional challenges for unconstrained, real-world physics, lighting, and backgrounds. Non-autoregressive diffusion currently limits the feasible sequence length and the ability to accommodate on-the-fly user adjustments. Notable failure modes include inaccurate hand articulation under challenging views or poor reconstruction of background regions not observed in the input trajectory.

Potential future advancements include fully autoregressive 4D diffusion, large-scale joint disentanglement training on real video corpora, and incorporation of physics-informed temporal priors (e.g., using velocity and acceleration constraints) to enhance realism and generalization (Wang et al., 4 Dec 2025).

Key references: "BulletTime: Decoupled Control of Time and Camera Pose for Video Generation" (Wang et al., 4 Dec 2025), "A Portable Multiscopic Camera for Novel View and Time Synthesis in Dynamic Scenes" (Zhang et al., 2022).