VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models (2502.02492v2)

Published 4 Feb 2025 in cs.CV

Abstract: Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

Summary

The paper introduces VideoJAM, a novel framework enforcing a joint appearance-motion representation to boost motion coherence in videos.
The framework modifies the training objective to predict both pixels and optical flow, enabling dynamic inner-guidance during inference.
Empirical results demonstrate state-of-the-art performance in motion generation without compromising visual quality.

The paper "VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models" addresses a critical issue in generative video models—motion coherence. Despite the notable progress made in enhancing the visual appearance of generated videos, maintaining realistic and coherent motion has proven to be challenging. The authors identify a significant limitation in traditional video generation models: the bias towards pixel fidelity rather than motion coherence, largely due to the prevalent pixel reconstruction objective used during model training.

Key Contributions:

VideoJAM Framework: The authors introduce VideoJAM, a novel framework designed to instill a robust motion prior within video generation models. This framework encourages the learning of a joint appearance-motion representation. VideoJAM is composed of two essential components:
- Training Phase: VideoJAM modifies the training objective to predict pixels along with their corresponding motion, leading to a joint appearance-motion representation. This involves minimal model modifications, mainly by introducing two linear layers to incorporate motion data.
- Inference Phase (Inner-Guidance): During inference, the framework uses Inner-Guidance, a dynamic mechanism leveraging the model's evolving motion predictions to guide generation toward consistent motion outputs. This approach adapts the model’s sampling distribution to align generation with the joint appearance-motion distribution.
Compatibility and Adaptability: Notably, VideoJAM is designed to be compatible with any video generation model, requiring no adjustments to the training datasets or scaling of the model size. This adaptability highlights the framework’s utility in a wide range of applications.
Performance Enhancement: Empirical evaluations demonstrate that VideoJAM achieves state-of-the-art results in maintaining motion coherence while simultaneously enhancing visual quality. This is achieved without degrading appearance quality, indicating the complementary nature of appearance and motion when integrated effectively.

Experimental Insights and Methodology:

The authors conduct both qualitative and quantitative experiments to substantiate their claims. They find that traditional models are almost invariant to temporal disarrangements at early stages of generation, suggesting a lack of sensitivity to motion coherence. VideoJAM, through its joint representation approach, addresses this by increasing the model’s temporal sensitivity, thereby facilitating coherent motion across generated sequences.

Optical Flow Representation: VideoJAM employs optical flow as the motion representation, which assists the model in disentangling motion from appearance. The process consists of transforming the optical flow into an RGB format to be processed within the Temporal Auto-Encoder (TAE), allowing for seamless integration into the model's learning process.
Enhanced Training Objective: By shifting the training focus to incorporate motion prediction, VideoJAM encourages balance between appearance fidelity and motion accuracy. This is achieved through a reformulated training loss function facilitating simultaneous prediction of visual and kinetic (motion) elements.

Conclusion:

Through VideoJAM, the authors present a compelling case for incorporating motion priors in video generation tasks. They effectively demonstrate that the joint consideration of appearance and motion can significantly improve generated video quality. This work serves as a pivotal step in refining video generation technologies, promising applications that require high fidelity in both visual details and dynamic realism. The approach sidesteps the necessity for extensive architectural changes or additional data requirements, making it a versatile solution for existing and future video generation systems.