Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Image Content Generation with Causal Reasoning (2312.07132v1)

Published 12 Dec 2023 in cs.CV and cs.MM

Abstract: The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic \textit{Tom and Jerry} animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage. The code and dataset are publicly available at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. With or without you: predictive coding and Bayesian inference in the brain. Current opinion in neurobiology, 46: 219–227.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
  3. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  5. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  7. A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226.
  8. Scaling Instruction-Finetuned Language Models.
  9. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 8780–8794.
  10. Shapecrafter: A recursive text-conditioned 3d shape generation model. arXiv preprint arXiv:2207.09446.
  11. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4): 1–13.
  12. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  14. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  15. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 2(5): 580–593.
  16. IDEA-CCNL. 2021. Fengshenbang-LM. https://github.com/IDEA-CCNL/Fengshenbang-LM.
  17. Semantics-Guided Latent Space Exploration for Shape Generation. In Computer Graphics Forum, volume 40, 115–126. Wiley Online Library.
  18. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276.
  19. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  20. Auto-Encoding Variational Bayes. .
  21. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  23. Towards implicit text-guided 3d shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17896–17906.
  24. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, 70–87. Springer.
  25. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1): 99–106.
  26. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  27. OpenAI. 2023. GPT-4 Technical Report. .
  28. Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models. arXiv preprint arXiv:2211.10950.
  29. The book of why: the new science of cause and effect. Basic books.
  30. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3829–3833. IEEE.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  32. Improving language understanding by generative pre-training.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  36. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
  38. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
  39. Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data. arXiv preprint arXiv:2302.12822.
  40. Smiley, T. J. 1973. What is a syllogism? Journal of philosophical logic, 136–154.
  41. Pixel recurrent neural networks. In International conference on machine learning, 1747–1756. PMLR.
  42. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164.
  43. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR, abs/2209.02970.
  44. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. arXiv preprint arXiv:2305.04091.
  45. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  46. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
  47. Frechet Inception Distance (FID) for Evaluating GANs. China University of Mining Technology Beijing Graduate School.
  48. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, 7962–7966. IEEE.
  49. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4470–4474. IEEE.
  50. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488.
  51. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? arXiv preprint arXiv:2303.11717.
  52. Learning causality and causality-related learning: some recent progress. National science review, 5(1): 26–29.
  53. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  54. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com