Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

120 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Unified Video Action Model (2503.00200v3)

Published 28 Feb 2025 in cs.RO and cs.CV

Abstract: A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

Summary

The paper introduces a unified video-action latent representation that integrates visual context with dynamic action data for improved policy learning.
The paper proposes a decoupled diffusion approach that bypasses full video generation during inference, ensuring fast and accurate action prediction.
The paper employs masked training to flexibly handle various robotics tasks, including forward/inverse dynamics modeling and video prediction.

Introduction

In robotics, effectively integrating visual understanding and action execution is crucial. A "Unified Video Action Model" (UVA) (2503.00200) can significantly improve performance because videos provide rich scene information vital for action prediction, while actions reveal dynamic insights essential for video prediction. Tasks like policy learning, forward/inverse dynamics modeling, and video prediction benefit from the simultaneous modeling of these two modalities. However, current approaches often fall short, with some focusing solely on action prediction and missing valuable contextual information, while others use computationally expensive hierarchical video generation followed by action prediction, leading to slower inference and potential error propagation. UVA bridges this gap by learning a joint video-action latent representation. Its key feature is a decoupled decoding mechanism with lightweight diffusion heads for efficient action inference, bypassing video generation during inference, which is suitable for real-time robotic applications. This joint modeling creates a versatile framework applicable to diverse tasks.

Motivation and Background

The motivation behind the "Unified Video Action Model" (UVA) (2503.00200) stems from the limitations of current video generation-based methods in achieving comparable action accuracy and inference speed to direct policy learning approaches. Existing methods often struggle to combine video generation and action prediction effectively. Approaches focused solely on action prediction do not learn scene dynamics, while hierarchical video generation methods followed by action prediction suffer from slow inference and error propagation.

A unified video and action model is highly desirable because videos offer essential environmental context for action prediction. Conversely, actions reveal how interactions lead to visual changes, enabling a more accurate understanding of real-world dynamics. The model aims to capture the shared dynamics between visual and action domains. The main challenge is balancing the temporal speed needed for action modeling with the spatial resolution required for video generation.

UVA Methodology

The UVA model (2503.00200) addresses the limitations of existing approaches through three key design choices:

Unified Latent Video-Action Representation: UVA learns a joint latent representation integrating visual and action data. Instead of a hierarchical video-then-action approach, UVA is trained simultaneously with supervision from both video and action data. This enables the model to capture the underlying dynamics shared between the visual and action domains while reducing computational overhead. The latent representation encodes rich scene information for precise action predictions. Historical image observations are processed through a pre-trained VAE encoder to obtain latent representations, flattened, and projected into $d$ -dimensional latent vectors, forming a sequence of visual tokens. Historical actions are sampled at a higher frequency and converted into action tokens with $d$ -dimensional latent representations.

# Historical image observations (batch_size, timesteps, height, width, channels)
images = get_historical_images()
# Historical actions (batch_size, timesteps, action_dim)
actions = get_historical_actions()

# VAE Encoding
vae_encoder = load_pretrained_vae_encoder()
latent_images = vae_encoder(images) # (batch_size, timesteps, latent_dim)

# Flatten and project image latents
flattened_images = flatten(latent_images) # (batch_size, timesteps * latent_dim)
image_tokens = project(flattened_images, d) # (batch_size, timesteps * latent_dim, d)

# Convert actions to tokens
action_tokens = project(actions, d) # (batch_size, timesteps, d)

Decoupled Video-Action Diffusion for Fast Inference: UVA decouples video generation from action prediction to enhance efficiency. Two lightweight diffusion heads are used during training to decode video observations and actions from the unified latent space. During inference, the model bypasses video generation and directly utilizes the latent representation for fast action prediction. This allows real-time policy deployment without sacrificing the rich representations learned during training from visual motions and robot action trajectories. The joint latent representation serves as the conditioning input for both the video and action diffusion decoders. The video diffusion decoder predicts individual patches in the video frame, reshaped and sent to the VAE decoder to reconstruct the full frame. The action diffusion decoder aggregates all latent tokens to produce an action latent, which encodes both visual and action-related information and serves as the condition for generating the action chunk.

# Unified latent representation
unified_latent = get_unified_latent_representation(image_tokens, action_tokens)

# Video diffusion decoder
video_diffusion_decoder = load_video_diffusion_decoder()
video_patches = video_diffusion_decoder(unified_latent)
full_frame = reconstruct_frame(video_patches)

# Action diffusion decoder
action_diffusion_decoder = load_action_diffusion_decoder()
action_chunk = action_diffusion_decoder(unified_latent)

# Inference: Bypassing video generation
if is_inference:
  action = fast_action_prediction(unified_latent)

Masked Training for Flexibility: UVA employs masked training to enable a diverse set of functionalities. By selectively masking actions or videos, a single model can perform various tasks beyond policy learning, including forward and inverse dynamics modeling and video generation. Unused components are masked and replaced with a learned mask token. Action and video losses are selectively applied to supervise the model depending on the specific task.
- For forward dynamics modeling (predicting the next video frame), the action is provided, and the video is predicted. The video loss is applied.
- For inverse dynamics modeling (predicting the action that led to a video change), both video frames are given, and the action is predicted. The action loss is applied.
- For policy learning, the model predicts the action based on the current video frame; the action loss is applied.
- For video prediction, the model predicts future video frames based on past frames, without any action. The action inputs are masked.

# Masked training
if task == "forward_dynamics":
  mask_actions = False
  mask_videos = False
  apply_action_loss = False
  apply_video_loss = True
elif task == "inverse_dynamics":
  mask_actions = False
  mask_videos = False
  apply_action_loss = True
  apply_video_loss = False
elif task == "policy_learning":
  mask_actions = False
  mask_videos = True
  apply_action_loss = True
  apply_video_loss = False
elif task == "video_prediction":
  mask_actions = True
  mask_videos = False
  apply_action_loss = False
  apply_video_loss = True

# Apply masks
if mask_actions:
  actions = mask_input(actions)
if mask_videos:
  videos = mask_input(videos)

# Calculate loss
if apply_action_loss:
  action_loss = calculate_action_loss(predicted_actions, ground_truth_actions)
if apply_video_loss:
  video_loss = calculate_video_loss(predicted_videos, ground_truth_videos)

# Optimize
optimize_model(action_loss, video_loss)

Performance and Advantages in Action Inference

UVA achieves performance advantages in action inference, mainly in speed and accuracy, by using a joint latent representation and decoupled decoding (2503.00200). The joint latent representation allows the model to capture the shared dynamics between visual and action domains. During inference, UVA bypasses video generation, directly using the latent representation for fast action prediction.

The efficiency stems from the decoupled decoding process, facilitated by lightweight diffusion heads. During training, two diffusion heads decode video observations and actions from the unified latent space. During inference, the action diffusion head directly predicts actions from the joint latent representation, skipping the computationally intensive video generation step.

UVA's action inference speed is comparable to action-only methods like Diffusion Policy while maintaining superior performance, especially in multi-task settings. Experiments show that UVA can achieve similar speeds to Diffusion Policy but with higher accuracy across various robotics tasks, indicating that UVA is more effective at learning and leveraging general dynamics shared across different tasks due to the unified latent representation.

Versatile Functionality and Applications

The Unified Video Action model (UVA) excels in its versatile functionality, resulting from its masked training approach (2503.00200). This allows UVA to be a general-purpose solution for a wide array of robotics tasks. By selectively masking either actions or videos during training, the model can adapt to diverse challenges beyond policy learning. This adaptability is key to UVA's strength, allowing a single model to be used across different applications.

Specific applications include:

Policy Learning: UVA can train robot policies, effectively mapping visual observations to corresponding actions. The joint video-action latent representation ensures that the policy benefits from a rich understanding of the environment's dynamics. The fast action inference, achieved through decoupled decoding, allows for real-time policy deployment.
Forward Dynamics Modeling: UVA can predict future states (videos) based on current observations and actions. This predictive capability makes it a powerful tool for planning, allowing robots to anticipate the outcomes of their actions and make informed decisions. The forward dynamics are learned through the video generation component of UVA.
Inverse Dynamics Modeling: UVA can infer the actions that caused a change in the environment, given two consecutive video frames. This functionality is valuable for understanding the consequences of past actions and for imitation learning, where a robot learns to mimic observed behavior.
Video Prediction/Generation: UVA can generate future video frames conditioned on past observations and actions, allowing the model to simulate potential future scenarios, which can be useful for planning and risk assessment. Furthermore, UVA can function purely as a video model by only conditioning on past observations.
Combined Policy and Planner: UVA can potentially perform low-level control and high-level planning simultaneously. The policy learning component enables low-level control, while the forward dynamics modeling capability supports high-level planning by predicting future states.

The masked training approach is implemented by masking the unused components of the model and replacing them with a learned mask token. Action and video losses are selectively applied to supervise the model depending on the specific task, allowing UVA to adapt to different tasks without compromising performance, making it a versatile and efficient solution for a wide range of robotics applications.

Experimental Results and Benchmarks

The "Unified Video Action Model" (UVA) (2503.00200) was evaluated on a diverse set of robotics tasks using publicly available benchmarks to assess its capabilities in policy learning, forward/inverse dynamics modeling, and video prediction.

The experimental results demonstrate that UVA achieves performance comparable to or exceeding state-of-the-art baselines across these tasks and exhibits strong performance in multi-task settings. This indicates that the model effectively learns and leverages general dynamics shared across different tasks, making it a versatile solution for various robotics applications. The reported results showcase UVA's ability to generalize and adapt to new situations, a critical aspect of real-world robotic systems. Although the specific benchmarks used were not explicitly mentioned in the abstract, the claim of using publicly available benchmarks suggests standard datasets for robot manipulation and control.

Conclusion

The "Unified Video Action Model" (UVA) (2503.00200) represents a significant advancement in robotics by bridging the gap between video generation and policy learning. UVA achieves both high accuracy and efficient action inference through its joint video-action latent representation and decoupled decoding mechanism. Its versatility is demonstrated across various robotics tasks, including policy learning, forward/inverse dynamics modeling, and video prediction, without compromising performance compared to specialized methods.

Future directions include exploring more complex robotic environments, incorporating additional sensory modalities (e.g., tactile feedback, audio), and developing more sophisticated training techniques to improve the model's generalization capabilities further. The broader impact of unified video and action models in robotics lies in their potential to enable more intelligent, adaptable, and efficient robotic systems that can seamlessly interact with and learn from their environment. This unified approach paves the way for robots that can perform specific tasks with high precision and understand and adapt to new situations, ultimately leading to more robust and versatile robotic solutions.

PDF Markdown

GitHub

Tweets

https://twitter.com/ShuangL13799063/status/1897006636067422498

https://twitter.com/ShuangL13799063/status/1897030182613213392

https://twitter.com/gm8xx8/status/1897445703107969417

https://twitter.com/semisance/status/1896857246195491320

https://twitter.com/ShuangL13799063/status/1897006645479399768