Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks (2405.10122v1)
Abstract: Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While LLMs have become adept at generating coherent textual steps, Large Vision/LLMs (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across steps in both domains.
- The covid-19 pandemic and the changes in consumer habits and behavior. Revista Gestão e Desenvolvimento, 18(3):3–25.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf.
- Lucy Brimble. 2020. More than 70% of adults use social media for recipes instead of cookbooks, survey finds. Independent UK.
- Wizard of tasks: A novel conversational dataset for solving real-world tasks in conversational settings. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 3514–3529. International Committee on Computational Linguistics.
- Dall·e mini.
- Aligning actions across recipe graphs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6930–6942, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Improved visual story generation with adaptive context modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4939–4955, Toronto, Canada. Association for Computational Linguistics.
- Dreamsim: Learning new dimensions of human visual similarity using synthetic data.
- Imagine this! scripts to compositions to videos. In European Conference on Computer Vision.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computational Linguistics.
- Plot and rework: Modeling storylines for visual storytelling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4443–4453, Online. Association for Computational Linguistics.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239, San Diego, California. Association for Computational Linguistics.
- Deepstory: Video story qa by deep embedded memory networks.
- Lavis: A library for language-vision intelligence.
- Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6329–6338.
- Lost in the middle: How language models use long contexts.
- Multimodal procedural planning via dual text-image prompting.
- Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, page 70–87, Berlin, Heidelberg. Springer-Verlag.
- Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 5321–5329. ACM.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Synthesizing coherent story with auto-regressive latent diffusion models.
- amused: An open MUSE reproduction. CoRR, abs/2401.01808.
- Shams Bin Quader. 2022. How the central sydney independent musicians use pre-established ‘online diy’to sustain their networking during the covid-19 pandemic. The Journal of International Communication, 28(1):90–109.
- Learning transferable visual models from natural language supervision.
- Make-a-story: Visual memory conditioned consistent story generation.
- Make-a-story: Visual memory conditioned consistent story generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2493–2502.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487.
- Do-it-yourself (diy) science: The proliferation, relevance and concerns. Technological Forecasting and Social Change, 158:120127.
- Frank Serafini. 2014. Reading the visual: An introduction to teaching multimodal literacy. Teachers College Press.
- Storytelling from an image stream using scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):9185–9192.