Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4 33 tok/s Pro
2000 character limit reached

DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes (2410.18084v2)

Published 23 Oct 2024 in cs.CV and cs.RO

Abstract: Urban scene generation has been developing rapidly recently. However, existing methods primarily focus on generating static and single-frame scenes, overlooking the inherently dynamic nature of real-world driving environments. In this work, we introduce DynamicCity, a novel 4D occupancy generation framework capable of generating large-scale, high-quality dynamic 4D scenes with semantics. DynamicCity mainly consists of two key models. 1) A VAE model for learning HexPlane as the compact 4D representation. Instead of using naive averaging operations, DynamicCity employs a novel Projection Module to effectively compress 4D features into six 2D feature maps for HexPlane construction, which significantly enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in parallel, which improves both network training efficiency and reconstruction accuracy than naively querying each 3D point (up to 7.05 mIoU gain, 2.06x training speedup, and 70.84% memory reduction). 2) A DiT-based diffusion model for HexPlane generation. To make HexPlane feasible for DiT generation, a Padded Rollout Operation is proposed to reorganize all six feature planes of the HexPlane as a squared 2D feature map. In particular, various conditions could be introduced in the diffusion or sampling process, supporting versatile 4D generation applications, such as trajectory- and command-driven generation, inpainting, and layout-conditioned generation. Extensive experiments on the CarlaSC and Waymo datasets demonstrate that DynamicCity significantly outperforms existing state-of-the-art 4D occupancy generation methods across multiple metrics. The code and models have been released to facilitate future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417, 2023.
  2. Tc4d: Trajectory-conditioned text-to-4d generation. arXiv preprint arXiv:2403.17920, 2024.
  3. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4413–4421, 2018.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  5. Deep generative modeling of lidar data. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5034–5040, 2019.
  6. nuscenes: A multimodal dataset for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11621–11631, 2020.
  7. Hexplane: A fast representation for dynamic scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  130–141, 2023.
  8. Efficient geometry-aware 3d generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16123–16133, 2022.
  9. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3075–3084, 2019.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pp.  16344–16359, 2022.
  11. K-planes: Explicit radiance fields in space, time, and appearance. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12479–12488, 2023.
  12. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  13. Unified 3d and 4d panoptic segmentation via dynamic shifting networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3480–3495, 2024.
  14. Rangeldm: Fast realistic lidar point cloud generation. In European Conference on Computer Vision, pp.  115–135, 2024.
  15. Spatio-temporal self-supervised representation learning for 3d point clouds. In IEEE/CVF International Conference on Computer Vision, pp.  6535–6545, 2021.
  16. Consistent4d: Consistent 360° dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
  17. Semcity: Semantic scene generation with triplane diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  28337–28347, 2024.
  18. Pyramid diffusion for fine 3d large scene generation. arXiv preprint arXiv:2311.12085, 2023a.
  19. Meshdiffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations, 2023b.
  20. Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation. In European Conference on Computer Vision, pp.  1–19, 2024.
  21. Learning to drop points for lidar scan synthesis. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  222–229, 2021.
  22. Lidar data synthesis with denoising diffusion probabilistic models. In IEEE International Conference on Robotics and Automation, pp.  14724–14731, 2024.
  23. Generative range imaging for learning scene priors of 3d lidar data. In IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1256–1266, 2023.
  24. Scaling diffusion models to real-world 3d lidar scene completion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14770–14780, 2024.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
  26. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  27. Towards realistic scene generation with lidar diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14738–14748, 2024.
  28. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
  29. L4gm: Large 4d gaussian reconstruction model. arXiv preprint arXiv:2406.10324, 2024a.
  30. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4209–4219, 2024b.
  31. Datenerf: Depth-aware text-based editing of nerfs. arXiv preprint arXiv:2404.04526, 2024.
  32. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  33. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  34. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015.
  35. Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations, 2022.
  36. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  37. Scalability in perception for autonomous driving: Waymo open dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2446–2454, 2020.
  38. Rethinking the inception architecture for computer vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2826, 2015.
  39. Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pp.  685–702, 2020.
  40. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. In Advances in Neural Information Processing Systems, volume 36, pp.  64318–64330, 2023.
  41. Occsora: 4d occupancy generation models as world simulators for autonomous driving. arXiv preprint arXiv:2405.20337, 2024.
  42. Motionsc: Data set and network for real-time semantic mapping in dynamic environments. IEEE Robotics and Automation Letters, 7(3):8439–8446, 2022.
  43. Unique3d: High-quality and efficient 3d mesh generation from a single image. arXiv preprint arXiv:2405.20343, 2024a.
  44. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024b.
  45. Ultralidar: Learning compact representations for lidar completion and generation. arXiv preprint arXiv:2311.01448, 2023.
  46. 4d contrastive superflows are dense 3d representation learners. In European Conference on Computer Vision, pp.  58–80, 2024.
  47. Lidar4d: Dynamic neural fields for novel space-time view lidar synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5145–5154, 2024.
  48. Learning to generate realistic lidar point clouds. In European Conference on Computer Vision, pp.  17–35, 2022.
Citations (1)

Summary

  • The paper introduces a novel framework that encodes dynamic 4D LiDAR scenes into a compact HexPlane representation using a Variational Autoencoder.
  • It employs a Diffusion Transformer with a Padded Rollout Operation to capture complex spatial-temporal relationships, achieving state-of-the-art mIoU gains and training speed improvements.
  • The approach enhances high-fidelity scene generation for autonomous driving and robotics, setting a new standard for modeling dynamic real-world environments.

Overview of "DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes"

The paper "DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes" introduces a novel framework for generating large-scale, high-quality 4D LiDAR scenes. This work primarily focuses on overcoming the limitations of existing models, which are often restricted to static or single-frame scenes, by capturing the dynamic nature and temporal evolution present in real-world driving environments.

Key Components

DynamicCity Framework: The core contributions of the DynamicCity framework are twofold:

  1. Variational Autoencoder (VAE) for 4D Representation: The VAE is employed to encode dynamic LiDAR scenes into a compact 4D representation known as HexPlane, which consists of six 2D feature maps. This involves a Projection Module that compresses 4D features and an Expansion Squeeze Strategy for efficient reconstruction, resulting in substantial improvements in training speed, reconstruction accuracy, and memory efficiency.
  2. Diffusion Transformer (DiT) for HexPlane Generation: For generating HexPlane, a DiT-based framework is utilized. The Padded Rollout Operation within this framework reorganizes the feature planes into a cohesive 2D structure, allowing the model to capture intricate spatial and temporal relationships, thus enhancing generation quality.

Numerical Results

The framework demonstrates significant advancements over state-of-the-art methods. Notably, the experiments conducted on CarlaSC and Waymo datasets illustrate that DynamicCity achieves superior 4D reconstruction and generation performance, evidenced by strong mIoU gains and training speedup metrics. The framework's capability is enhanced by integrating various conditions during the generation process, enabling diverse applications such as trajectory-guided generation and dynamic scene inpainting.

Implications and Future Research

From a practical perspective, DynamicCity has the potential to enhance applications in autonomous driving and robotic navigation by providing high-fidelity dynamic scenes that better reflect real-world conditions. Theoretical implications involve advancing understanding of how dynamic environments can be efficiently modeled and represented, paving the way for future research in high-dimensional data representation.

Looking forward, the framework's adaptability suggests its use could extend to other domains requiring dynamic spatial-temporal data generation. Future developments might focus on further improving the model's efficiency and exploring its integration with real-time data processing systems.

Conclusion

DynamicCity represents a significant advancement in the field of 4D LiDAR scene generation, offering a robust solution to the challenges of modeling dynamic environments. Through its innovative use of VAE and DiT, combined with HexPlane's efficient representation, DynamicCity sets a new standard for scene generation in complex driving scenarios. The open release of the code promises to facilitate further research and development, fostering continued innovation in the field.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 posts and received 110 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube