Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Progressive Text-to-Image Generation (2210.02291v5)

Published 5 Oct 2022 in cs.CV

Abstract: Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of image patches. In this paper, we present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner, and this procedure is recursively applied with the proposed error revision mechanism until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments in MS COCO benchmark demonstrate that the progressive model produces significantly better results compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the design of parallel generation in each step allows more than $\times 13$ inference acceleration with slight performance loss.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer, 2016.
  2. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  3. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE transactions on pattern analysis and machine intelligence, 2021.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  6. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  7. X-lxmert: Paint, caption and answer questions with multi-modal transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8785–8805, 2020.
  8. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053, 2022.
  9. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10578–10587, 2020.
  10. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  11. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  12. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  13. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
  14. Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
  15. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in Neural Information Processing Systems, 34:3518–3532, 2021.
  16. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  17. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.
  18. Non-autoregressive neural machine translation. In International Conference on Learning Representations, 2018.
  19. Fully non-autoregressive neural machine translation: Tricks of the trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 120–133, 2021.
  20. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  22. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  23. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
  24. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022.
  25. Argmax flows and multinomial diffusion: Towards non-autoregressive language models. 2021.
  26. Diff-tts: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409, 2021.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  28. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  29. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  30. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  31. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2020.
  32. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
  33. Draft-and-revise: Effective image generation with contextual rq-transformer. arXiv preprint arXiv:2206.04452, 2022.
  34. Progressive deblurring of diffusion models for coarse-to-fine image synthesis. arXiv preprint arXiv:2207.11192, 2022.
  35. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  36. M6: A chinese multimodal pretrainer. arXiv preprint arXiv:2103.00823, 2021.
  37. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  38. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  39. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  41. Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
  42. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  43. Improving language understanding by generative pre-training. 2018.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  46. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  47. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  48. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  49. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  50. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  51. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  52. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  54. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  55. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  56. Progressive generation of long text with pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4313–4324, 2021.
  57. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  58. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  61. Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386, 2022.
  62. Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814, 2022.
  63. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  64. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  65. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  66. Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation. arXiv preprint arXiv:2112.15283, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhengcong Fei (27 papers)
  2. Mingyuan Fan (35 papers)
  3. Li Zhu (83 papers)
  4. Junshi Huang (24 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.