Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 156 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios (2411.01123v1)

Published 2 Nov 2024 in cs.CV

Abstract: Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under-exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual-branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at https://github.com/yichen928/X-Drive.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Sequential modeling enables scalable learning for large vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22861–22872, 2024.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  4. Deep generative modeling of lidar data. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5034–5040. IEEE, 2019.
  5. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11621–11631, 2020.
  6. Sound2sight: Generating visual dynamics from sound and context. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp.  701–719. Springer, 2020.
  7. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp.  349–357, 2017.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  10. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  2918–2927, 2021.
  11. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023.
  12. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  14. Rangeldm: Fast realistic lidar point cloud generation. arXiv preprint arXiv:2403.10094, 2024a.
  15. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024b.
  16. S3superscriptS3\textit{S}^{3}S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTgaussian: Self-supervised street gaussians for autonomous driving, 2024. URL https://arxiv.org/abs/2405.20323.
  17. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1125–1134, 2017.
  18. Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916, 2016.
  19. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22511–22521, 2023.
  20. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9298–9309, 2023a.
  21. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, pp.  531–548. Springer, 2022.
  22. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE international conference on robotics and automation (ICRA), pp.  2774–2781. IEEE, 2023b.
  23. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. arXiv preprint arXiv:2312.02934, 2023.
  24. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  25. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  26. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  27. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  29. Towards realistic scene generation with lidar diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14738–14748, 2024.
  30. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8555–8564, 2021.
  31. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp.  1278–1286. PMLR, 2014.
  32. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  33. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10219–10228, 2023.
  34. Zeronvs: Zero-shot 360-degree view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9420–9429, 2024.
  35. Circular convolutional neural networks for panoramic images and laser data. In 2019 IEEE intelligent vehicles symposium (IV), pp.  653–660. IEEE, 2019.
  36. Learning temporally consistent video depth from video diffusion priors. arXiv preprint arXiv:2406.01493, 2024.
  37. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2022.
  38. End-to-end multi-modal sensors fusion system for urban automated driving. 32nd Conference on Neural Information Processing Systems, 2018.
  39. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  40. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8922–8931, 2021.
  41. Street-view image generation from a bird’s-eye view layout. IEEE Robotics and Automation Letters, 2024.
  42. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
  43. Panacea: Panoramic and controllable video generation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6902–6912, 2024.
  44. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17591–17602, 2023.
  45. Cohere3d: Exploiting temporal coherence for unsupervised representation learning of vision-based autonomous driving. arXiv preprint arXiv:2402.15583, 2024.
  46. Unipad: A universal pre-training paradigm for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15238–15250, 2024a.
  47. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
  48. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10371–10381, 2024b.
  49. Depth anything v2. arXiv:2406.09414, 2024c.
  50. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022.
  51. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023.
  52. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22490–22499, 2023.
  53. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21634–21643, 2024.
  54. Discrete contrastive diffusion for cross-modal music and image generation. In The Eleventh International Conference on Learning Representations, 2022.
  55. Learning to generate realistic lidar point clouds. In European Conference on Computer Vision, pp.  17–35. Springer, 2022.
  56. Lidardm: Generative lidar simulation in a generated world. arXiv preprint arXiv:2404.02903, 2024.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.