DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes (2409.04003v2)

Published 6 Sep 2024 in cs.CV

Abstract: Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a 7-frame model, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulation platform DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents. The project page is available at https://pjlab-adg.github.io/DriveArena/dreamforge.

Authors (10)

Jianbiao Mei (19 papers)
Yukai Ma (16 papers)
Xuemeng Yang (18 papers)
Licheng Wen (31 papers)
Tiantian Wei (3 papers)
Min Dou (22 papers)
Botian Shi (57 papers)
Yong Liu (721 papers)
Tao Hu (146 papers)
Yu Yang (213 papers)

Summary

Overview of DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

The paper "DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes" introduces a diffusion-based autoregressive video generation model designed to produce long, 3D-controllable street view videos. The goal is to enhance autonomous driving (AD) applications by synthesizing varied and realistic driving scenes that are geometrically and contextually accurate. The model, DreamForge, addresses ongoing challenges in the field, such as maintaining temporal coherence and generating long videos across multiple views.

Key Features and Methodology

DreamForge incorporates several advanced features:

Flexible Control Conditions: Utilizing road layouts, 3D bounding boxes, and text descriptions, DreamForge allows for customizable video generation. This control extends to weather conditions, scene styles, and the geometrical distribution of features, enhancing the overall realism and applicability of the generated scenes.
Perspective Guidance: DreamForge employs explicit perspective guidance by projecting road layouts and 3D bounding boxes into the camera view. This significantly improves the generation of accurate geometric and contextual elements, as shown in the paper's comparative results.
Cross-View and Temporal Consistency: The architecture integrates cross-view attention and temporal coherence mechanisms. These ensure inter-view consistency and allow for long-term video generation by sequentially producing frames with the help of motion cues and an autoregressive approach.
Motion-Aware Autoregressive Generation: By incorporating motion-aware temporal attention, the model ensures coherence and consistency over extended sequences. This involves computing temporal attention using motion features extracted from previous frames and considering relative ego-motion cues.

Experimental Evaluation

The model's efficacy is evaluated using the nuScenes dataset, with a focus on generating multi-view driving scene videos while maintaining high fidelity and coherence. Key quantitative metrics include FID for image quality and FVD for video sequences, alongside downstream performance metrics such as mAP and mIoU. DreamForge demonstrated superior performance compared to baseline models, particularly in generating geometrically and contextually accurate street views, as evidenced by improvements in segmentation metrics and object detection accuracy.

Implications and Future Work

DreamForge presents a significant development for generating realistic driving scenes crucial for AD systems. With its enhanced capability to model extended scenes consistently, it is a valuable tool for simulating diverse driving environments required for robust training and evaluation of AD models.

Future work may focus on integrating more complex environmental conditions or enhancing real-time performance capabilities. Furthermore, expanding this approach to more diverse settings could also bolster the adaptability of AD technology in real-world scenarios.

The DreamForge model, with its nuanced capability of generating controllable, consistent, and diverse multi-view driving scenes, stands as a notable advancement, contributing positively to the ever-evolving field of autonomous driving and synthetic environment generation. The contribution ensures a heightened level of realism in training datasets, crucial for advancing the operational efficiency and safety of autonomous vehicles.

Related Papers

GitHub

GitHub - PJLab-ADG/DriveArena: DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving (236 stars)