What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? (2403.06090v4)
Abstract: Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
- Frequency-tuned salient region detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
- Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In Proc. IEEE Int. Conf. Comp. Vis., 2021.
- Sequential modeling enables scalable learning for large vision models. arXiv: Comp. Res. Repository, 2023.
- Cold diffusion: Inverting arbitrary image transforms without noise. Proc. Advances in Neural Inf. Process. Syst., 2024.
- Stylegan knows normal, depth, albedo, and more. Proc. Advances in Neural Inf. Process. Syst., 2024.
- Genie: Generative interactive environments. arXiv: Comp. Res. Repository, 2024.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In Proc. Int. Conf. Learn. Representations, 2023.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 2023.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
- Imagenet: A large-scale hierarchical image database. In arXiv: Comp. Res. Repository, 2009.
- Surface normal estimation of tilted images via spatial rectifier. In Proc. Eur. Conf. Comp. Vis., 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. Int. Conf. Learn. Representations, 2021.
- Generative models: What do they know? do they know things? let’s find out! arXiv: Comp. Res. Repository, 2023.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proc. IEEE Int. Conf. Comp. Vis., 2021.
- Scalable pre-training of large autoregressive image models. arXiv: Comp. Res. Repository, 2024.
- A new way to evaluate foreground maps. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
- Enhanced-alignment measure for binary foreground map evaluation. In Proc. International Joint Conf. Artificial Intelligence, 2018.
- Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
- Cognitive vision inspired object segmentation metric and loss function. Scientia Sinica Informationis, 2021.
- Vision meets robotics: The kitti dataset. Int. J. Robotics Research, 2013.
- Masked autoencoders are scalable vision learners. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022.
- Momentum contrast for unsupervised visual representation learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
- Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
- Denoising diffusion probabilistic models. Proc. Advances in Neural Inf. Process. Syst., 2020.
- Training compute-optimal large language models. arXiv: Comp. Res. Repository, 2022.
- LoRA: Low-rank adaptation of large language models. In Proc. Int. Conf. Learn. Representations, 2022.
- Framenet: Learning local canonical frames of 3d surfaces from a single rgb image. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
- Repurposing diffusion-based image generators for monocular depth estimation. arXiv: Comp. Res. Repository, 2023.
- Auto-encoding variational {{\{{Bayes}}\}}. In Proc. Int. Conf. Learn. Representations, 2014.
- Segment anything. Proc. IEEE Int. Conf. Comp. Vis., 2023.
- Discriminatively trained dense surface normal estimation. In Proc. Eur. Conf. Comp. Vis., 2014.
- Exploiting diffusion prior for generalizable pixel-level semantic prediction. arXiv: Comp. Res. Repository, 2023.
- Privacy-preserving portrait matting. In Proc. ACM Int. Conf. Multimedia, 2021.
- Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comp. Vis., 2014.
- How to evaluate foreground maps? In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2014.
- Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021.
- openai. GPT-4V(ision) technical work and authors. Technical report, 2023.
- DINOv2: Learning robust visual features without supervision. Trans. Mach. Learn. Research, 2024.
- Saliency filters: Contrast based filtering for salient region detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012.
- Unigs: Unified representation for image generation and segmentation. arXiv: Comp. Res. Repository, 2023.
- Highly accurate dichotomous image segmentation. In Proc. Eur. Conf. Comp. Vis., 2022.
- U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recogn., 2020.
- Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., 2021.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
- Vision transformers for dense prediction. In Proc. IEEE Int. Conf. Comp. Vis., 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell., 2020.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proc. IEEE Int. Conf. Comp. Vis., 2021.
- High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022.
- Progressive distillation for fast sampling of diffusion models. In Proc. Int. Conf. Learn. Representations, 2021.
- A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
- LAION-5B: An open large-scale dataset for training next generation image-text models. Proc. Advances in Neural Inf. Process. Syst., 2022.
- Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comp. Vis., 2012.
- Denoising diffusion implicit models. In Proc. Int. Conf. Learn. Representations, 2020.
- LLaMA: Open and efficient foundation language models. arXiv: Comp. Res. Repository, 2023.
- A simple latent diffusion approach for panoptic segmentation and mask inpainting. arXiv: Comp. Res. Repository, 2024.
- DIODE: A Dense Indoor and Outdoor DEpth Dataset. arXiv: Comp. Res. Repository, 2019.
- Vplnet: Deep single view normal estimation with vanishing points and lines. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
- Images speak in images: A generalist painter for in-context visual learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023.
- Denoising diffusion autoencoders are unified self-supervised learners. In Proc. IEEE Int. Conf. Comp. Vis., 2023.
- Simple baselines for human pose estimation and tracking. In Proc. Eur. Conf. Comp. Vis., 2018.
- Unified perceptual parsing for scene understanding. In Proc. Eur. Conf. Comp. Vis., 2018.
- ViTPose: Simple vision transformer baselines for human pose estimation. In Proc. Advances in Neural Inf. Process. Syst., 2022.
- Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
- Diversedepth: Affine-invariant depth prediction using diverse data. arXiv: Comp. Res. Repository, 2020.
- Learning to recover 3d scene shape from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021.
- Coca: Contrastive captioners are image-text foundation models. Trans. Machine Learning Research, 2022.
- Hierarchical normalization for robust monocular depth estimation. Proc. Advances in Neural Inf. Process. Syst., 2022.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Proc. Int. Conf. Learn. Representations, 2022.
- Scene parsing through ade20k dataset. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.