Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model (2401.06578v2)

Published 12 Jan 2024 in cs.CV

Abstract: Panorama video recently attracts more interest in both study and application, courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos, generating desirable panorama videos by prompts is urgently required. Lately, the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However, due to the significant gap in content and motion patterns between panoramic and standard videos, these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper, we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically, we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD, addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation. Our project page is at https://akaneqwq.github.io/360DVD/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Deep learning for omnidirectional vision: A survey and new perspectives. arXiv preprint arXiv:2205.10468, 2022.
  2. Diverse plausible 360-degree image outpainting for efficient 3dcg background creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11441–11450, 2022.
  3. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  4. Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1731–1745, 2023.
  5. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019.
  6. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023.
  7. Text2light: Zero-shot text-driven hdr panorama generation. ACM Transactions on Graphics (TOG), 41(6):1–16, 2022.
  8. Hybrid transformer and cnn attention network for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1702–1711, 2023a.
  9. Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784, 2023b.
  10. Null-space diffusion sampling for zero-shot point cloud completion. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), 2023c.
  11. Inout: Diverse image outpainting via gan inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11431–11440, 2022.
  12. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  13. Guided co-modulated gan for 360° field of view extrapolation. In 2022 International Conference on 3D Vision (3DV), pages 475–485. IEEE, 2022.
  14. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  15. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  16. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  17. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  18. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  19. Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
  20. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  21. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  22. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. arXiv preprint arXiv:2305.19195, 2023.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
  24. D3c2-net: Dual-domain deep convolutional coding network for compressive sensing. arXiv preprint arXiv:2207.13560, 2022b.
  25. Coco-gan: Generation by parts via conditional coordinating. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4512–4521, 2019.
  26. Infinitygan: Towards infinite-pixel image synthesis. arXiv preprint arXiv:2104.03963, 2021.
  27. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023a.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023b.
  29. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  30. Bips: Bi-modal indoor panorama synthesis via residual depth-aided adversarial learning. In European Conference on Computer Vision, pages 352–371. Springer, 2022.
  31. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  34. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  35. U-Net: Convolutional Networks for Biomedical Image Segmentation, page 234–241. 2015.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  38. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  39. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  40. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  42. Opdn: Omnidirectional position-aware deformable network for omnidirectional image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1293–1301, 2023.
  43. Boundless: Generative adversarial networks for image extension. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10521–10530, 2019.
  44. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  45. Stylelight: Hdr panorama generation for lighting estimation and editing. In European Conference on Computer Vision, pages 477–492. Springer, 2022.
  46. Customizing 360-degree panoramas through text-to-image diffusion models. arXiv preprint arXiv:2310.18840, 2023a.
  47. 360-degree panorama generation from few unregistered nfov images. arXiv preprint arXiv:2308.14686, 2023b.
  48. 360-degree panorama generation from few unregistered nfov images. arXiv preprint arXiv:2308.14686, 2023c.
  49. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023d.
  50. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023e.
  51. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023a.
  52. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769, 2023b.
  53. Cross-view panorama image synthesis. IEEE Transactions on Multimedia, 2022.
  54. Ipo-ldm: Depth-aided 360-degree indoor rgb panorama outpainting via latent diffusion model. arXiv preprint arXiv:2307.03177, 2023c.
  55. Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023.
  56. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023a.
  57. Neural video fields editing, 2023b.
  58. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023c.
  59. Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833, 2023a.
  60. CRoSS: Diffusion model makes controllable, robust and secure image steganography. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  61. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  62. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  63. Diffcollage: Parallel generation of large content with diffusion models. arXiv preprint arXiv:2303.17076, 2023b.
  64. Editguard: Versatile image watermarking for tamper localization and copyright protection. arXiv preprint arXiv:2312.08883, 2023c.
  65. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (8)

Summary

We haven't generated a summary for this paper yet.