DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes (2410.18084v2)
Abstract: Urban scene generation has been developing rapidly recently. However, existing methods primarily focus on generating static and single-frame scenes, overlooking the inherently dynamic nature of real-world driving environments. In this work, we introduce DynamicCity, a novel 4D occupancy generation framework capable of generating large-scale, high-quality dynamic 4D scenes with semantics. DynamicCity mainly consists of two key models. 1) A VAE model for learning HexPlane as the compact 4D representation. Instead of using naive averaging operations, DynamicCity employs a novel Projection Module to effectively compress 4D features into six 2D feature maps for HexPlane construction, which significantly enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in parallel, which improves both network training efficiency and reconstruction accuracy than naively querying each 3D point (up to 7.05 mIoU gain, 2.06x training speedup, and 70.84% memory reduction). 2) A DiT-based diffusion model for HexPlane generation. To make HexPlane feasible for DiT generation, a Padded Rollout Operation is proposed to reorganize all six feature planes of the HexPlane as a squared 2D feature map. In particular, various conditions could be introduced in the diffusion or sampling process, supporting versatile 4D generation applications, such as trajectory- and command-driven generation, inpainting, and layout-conditioned generation. Extensive experiments on the CarlaSC and Waymo datasets demonstrate that DynamicCity significantly outperforms existing state-of-the-art 4D occupancy generation methods across multiple metrics. The code and models have been released to facilitate future research.
- Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417, 2023.
- Tc4d: Trajectory-conditioned text-to-4d generation. arXiv preprint arXiv:2403.17920, 2024.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4413–4421, 2018.
- Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
- Deep generative modeling of lidar data. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5034–5040, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631, 2020.
- Hexplane: A fast representation for dynamic scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 130–141, 2023.
- Efficient geometry-aware 3d generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133, 2022.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pp. 16344–16359, 2022.
- K-planes: Explicit radiance fields in space, time, and appearance. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12479–12488, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Unified 3d and 4d panoptic segmentation via dynamic shifting networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3480–3495, 2024.
- Rangeldm: Fast realistic lidar point cloud generation. In European Conference on Computer Vision, pp. 115–135, 2024.
- Spatio-temporal self-supervised representation learning for 3d point clouds. In IEEE/CVF International Conference on Computer Vision, pp. 6535–6545, 2021.
- Consistent4d: Consistent 360° dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
- Semcity: Semantic scene generation with triplane diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28337–28347, 2024.
- Pyramid diffusion for fine 3d large scene generation. arXiv preprint arXiv:2311.12085, 2023a.
- Meshdiffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations, 2023b.
- Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation. In European Conference on Computer Vision, pp. 1–19, 2024.
- Learning to drop points for lidar scan synthesis. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 222–229, 2021.
- Lidar data synthesis with denoising diffusion probabilistic models. In IEEE International Conference on Robotics and Automation, pp. 14724–14731, 2024.
- Generative range imaging for learning scene priors of 3d lidar data. In IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1256–1266, 2023.
- Scaling diffusion models to real-world 3d lidar scene completion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14770–14780, 2024.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
- Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
- Towards realistic scene generation with lidar diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14738–14748, 2024.
- Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
- L4gm: Large 4d gaussian reconstruction model. arXiv preprint arXiv:2406.10324, 2024a.
- Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4209–4219, 2024b.
- Datenerf: Depth-aware text-based editing of nerfs. arXiv preprint arXiv:2404.04526, 2024.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015.
- Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations, 2022.
- Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
- Scalability in perception for autonomous driving: Waymo open dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454, 2020.
- Rethinking the inception architecture for computer vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2015.
- Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pp. 685–702, 2020.
- Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. In Advances in Neural Information Processing Systems, volume 36, pp. 64318–64330, 2023.
- Occsora: 4d occupancy generation models as world simulators for autonomous driving. arXiv preprint arXiv:2405.20337, 2024.
- Motionsc: Data set and network for real-time semantic mapping in dynamic environments. IEEE Robotics and Automation Letters, 7(3):8439–8446, 2022.
- Unique3d: High-quality and efficient 3d mesh generation from a single image. arXiv preprint arXiv:2405.20343, 2024a.
- Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024b.
- Ultralidar: Learning compact representations for lidar completion and generation. arXiv preprint arXiv:2311.01448, 2023.
- 4d contrastive superflows are dense 3d representation learners. In European Conference on Computer Vision, pp. 58–80, 2024.
- Lidar4d: Dynamic neural fields for novel space-time view lidar synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5145–5154, 2024.
- Learning to generate realistic lidar point clouds. In European Conference on Computer Vision, pp. 17–35, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.