LMD: Faster Image Reconstruction with Latent Masking Diffusion (2312.07971v1)
Abstract: As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with latent masking diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
- A survey on generative diffusion model. arXiv preprint arXiv:2209.02646.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704.
- Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction. arXiv preprint arXiv:2206.00790.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in NeurIPS, 34: 9355–9366.
- Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747.
- Imagenet: A large-scale hierarchical image database. In CVPR, 248–255.
- Diffusion models beat gans on image synthesis. Advances in NeurIPS, 34: 8780–8794.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
- Taming transformers for high-resolution image synthesis. In CVPR, 12873–12883.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 10696–10706.
- Masked autoencoders are scalable vision learners. In CVPR, 16000–16009.
- Denoising diffusion probabilistic models. Advances in NeurIPS, 33: 6840–6851.
- Green hierarchical vision transformer for masked image modeling. Advances in Neural Information Processing Systems, 35: 19997–20010.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 4171–4186.
- Maximum Likelihood Training of Implicit Nonlinear Diffusion Models. arXiv preprint arXiv:2205.13699.
- Semmae: Semantic-guided masking for learning masked autoencoders. arXiv preprint arXiv:2206.10207.
- Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality. arXiv preprint arXiv:2205.10063.
- Mst: Masked self-supervised transformer for visual representation. Advances in NeurIPS, 34: 13165–13176.
- MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning. arXiv preprint arXiv:2205.13137.
- Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778.
- More control for free! image synthesis with semantic diffusion guidance. arXiv preprint arXiv:2112.05744.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 10012–10022.
- Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos. arXiv preprint arXiv:2304.01186.
- Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490.
- UniTranSeR: A unified transformer semantic representation framework for multimodal task-oriented dialog system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 103–114.
- Cmal: A novel cross-modal associative learning framework for vision-language pre-training. In Proceedings of the 30th ACM International Conference on Multimedia, 4515–4524.
- HybridPrompt: bridging language models and human priors in prompt tuning for visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13371–13379.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
- Improved denoising diffusion probabilistic models. In ICML, 8162–8171.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Zero-shot text-to-image generation. In ICML, 8821–8831.
- High-resolution image synthesis with latent diffusion models. In CVPR, 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Score-based generative modeling in latent space. Advances in NeurIPS, 34: 11287–11302.
- Neural discrete representation learning. Advances in NeurIPS, 30.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 14668–14678.
- Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv preprint arXiv:2208.06677.
- Simmim: A simple framework for masked image modeling. In CVPR, 9653–9663.
- Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
- HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling. arXiv preprint arXiv:2205.14949.
- Image BERT Pre-training with Online Tokenizer. In ICLR.
- Zhiyuan Ma (70 papers)
- zhihuan yu (2 papers)
- Jianjun Li (15 papers)
- Bowen Zhou (141 papers)