Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation (2306.01923v2)

Published 2 Jun 2023 in cs.CV

Abstract: Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Attention Attention Everywhere: Monocular depth prediction with skip attention. In WACV, 2023.
  2. Scheduled sampling for sequence prediction with recurrent neural networks. NIPS, 2015.
  3. AdaBins: Depth estimation using adaptive bins. In CVPR, pages 4009–4018, 2021.
  4. A naturalistic open source movie for optical flow evaluation. In ECCV, pages 611–625, 2012.
  5. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE T-CSVT, 28(11):3174–3182, 2017.
  6. ShapeNet: An information-rich 3D model repository. arXiv:1512.03012, 2015.
  7. A generalist framework for panoptic segmentation of images and videos. In ICCV, 2023a.
  8. Analog bits: Generating discrete data using diffusion models with self-conditioning. In ICLR, 2023b.
  9. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017.
  10. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  11. Diffusion models beat GANs on image synthesis. In NeurIPS, 2022.
  12. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015.
  13. Depth map prediction from a single image using a multi-scale deep network. In NIPS, volume 27, 2014.
  14. FlowNet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
  15. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
  16. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In ECCV, pages 740–756, 2016.
  17. Vision meets Robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  18. Digging into self-supervised monocular depth estimation. In ICCV, pages 3828–3838, 2019.
  19. Kubric: A scalable dataset generator. In CVPR, pages 3749–3761, June 2022.
  20. Understanding real world indoor scenes with synthetic data. In CVPR, pages 4077–4085, 2016.
  21. Denoising Diffusion Probabilistic Models. NeurIPS, 2020.
  22. Fast cost-volume filtering for visual correspondence and beyond. IEEE T-PAMI, 35(2):504–511, 2012.
  23. FlowFormer: A transformer architecture for optical flow. In ECCV, pages 668–685, 2022.
  24. MirrorFlow: Exploiting symmetries in joint optical flow and occlusion estimation. In ICCV, pages 312–321, 2017.
  25. Iterative residual refinement for joint optical flow and occlusion estimation. In CVPR, pages 5754–5763, 2019.
  26. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 2462–2470, 2017.
  27. Occlusion-aware optical flow estimation. IEEE T-IP, 17(8):1443–1451, 2008.
  28. Perceiver IO: A general architecture for structured inputs & outputs. In ICLR, 2022.
  29. Imposing consistency for optical flow estimation. In CVPR, 2022.
  30. The HCI Benchmark Suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In CVPR Workshops, pages 19–28, 2016.
  31. Deeper depth prediction with fully convolutional residual networks. In 3DV, pages 239–248, 2016.
  32. Professor Forcing: A new algorithm for training recurrent networks. NIPS, 29, 2016.
  33. Learning representations for automatic colorization. In ECCV, pages 577–593, 2016.
  34. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326, 2019.
  35. BinsFormer: Revisiting adaptive bins for monocular depth estimation. arxiv.2204.00987, 2022.
  36. Sky Optimization: Semantically aware image processing of skies in low-light photography. In CVPR Workshops, June 2020.
  37. Infinite Nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
  38. Learning optical flow with adaptive graph reasoning. In AAAI, pages 1890–1898, 2022.
  39. Learning rigidity in dynamic scenes with a moving camera for 3D motion field estimation. In ECCV, pages 468–484, 2018.
  40. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  41. SceneNet RGB-D: Can 5M synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017.
  42. On distillation of guided diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  43. Joint 3D estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
  44. Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing (JPRS), 2018.
  45. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171, 2021.
  46. All in Tokens: Unifying output space of visual tasks via soft token. In ICCV, 2023.
  47. FiLm: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  48. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022.
  49. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE T-PAMI, 44(3):1623–1637, 2020.
  50. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
  51. Sequence level training with recurrent neural networks. In ICLR, 2016.
  52. Cross-Domain self-supervised multi-task feature learning using synthetic imagery. In CVPR, 2018.
  53. Playing for benchmarks. In ICCV, pages 2213–2222, 2017.
  54. Palette: Image-to-Image Diffusion Models. In SIGGRAPH, 2022a.
  55. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In NeurIPS, 2022b.
  56. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  57. Step-unrolled denoising autoencoders for text generation. In ICLR, 2022.
  58. Learning depth from single monocular images. NIPS, 2005.
  59. Make3D: Learning 3D scene structure from a single still image. IEEE T-PAMI, 31(5):824–840, 2009.
  60. 3D photography using context-aware layered depth inpainting. In CVPR, 2020.
  61. Indoor segmentation and support inference from RGBD images. In ECCV, pages 746–760, 2012.
  62. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
  63. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  64. Consistency models. In ICML, 2023.
  65. Inpainting of missing values in the kinect sensor’s depth maps based on background estimates. IEEE Sensors Journal, 14(4):1107–1116, 2014.
  66. CRAFT: Cross-attentional flow transformer for robust optical flow. In CVPR, pages 17602–17611, 2022.
  67. Layered segmentation and optical flow estimation over time. In CVPR, pages 1768–1775, 2012.
  68. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018.
  69. AutoFlow: Learning a better training set for optical flow. In CVPR, pages 10093–10102, 2021.
  70. Disentangling architecture and training for optical flow. In ECCV, pages 165–182, 2022a.
  71. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  72. SKFlow: Learning optical flow with super kernels. In NeurIPS, 2022b.
  73. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
  74. RAFT: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419, 2020.
  75. Imagen Editor and EditBench: Advancing and evaluating text-guided image inpainting. In CVPR, 2023.
  76. TartanAir: A dataset to push the limits of visual SLAM. In IROS, 2020.
  77. DeepFlow: Large displacement optical flow with deep matching. In ICCV, pages 1385–1392, 2013.
  78. SynSin: End-to-end view synthesis from a single image. In CVPR, 2020.
  79. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989.
  80. Revealing the dark secrets of masked image modeling. In CVPR, 2023.
  81. GMFlow: Learning optical flow via global matching. In CVPR, pages 8121–8130, 2022.
  82. Motion detail preserving optical flow estimation. IEEE T-PAMI, 34(9):1744–1757, 2011.
  83. Volumetric correspondence networks for optical flow. NeurIPS, 2019.
  84. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
  85. Separable flow: Learning motion cost volumes for optical flow estimation. In ICCV, pages 10807–10817, 2021.
  86. Colorful image colorization. In ECCV, pages 649–666, 2016.
  87. Transformer-based dual relation graph for multi-label image recognition. In ICCV, pages 163–172, 2021.
  88. Places: A 10 million image database for scene recognition. IEEE T-PAMI, 2017.
Citations (53)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com