Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control (2312.07549v1)

Published 6 Dec 2023 in cs.CV

Abstract: Story Visualization aims to generate images aligned with story prompts, reflecting the coherence of storybooks through visual consistency among characters and scenes.Whereas current approaches exclusively concentrate on characters and neglect the visual consistency among contextually correlated scenes, resulting in independent character images without inter-image coherence.To tackle this issue, we propose a new presentation form for Story Visualization called Storyboard, inspired by film-making, as illustrated in Fig.1.Specifically, a Storyboard unfolds a story into visual representations scene by scene. Within each scene in Storyboard, characters engage in activities at the same location, necessitating both visually consistent scenes and characters.For Storyboard, we design a general framework coined as Make-A-Storyboard that applies disentangled control over the consistency of contextual correlated characters and scenes and then merge them to form harmonized images.Extensive experiments demonstrate 1) Effectiveness.the effectiveness of the method in story alignment, character consistency, and scene correlation; 2) Generalization. Our method could be seamlessly integrated into mainstream Image Customization methods, empowering them with the capability of story visualization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Blended latent diffusion. TOG, 2023.
  2. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  3. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2018.
  4. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. TOG, 2023.
  5. Character-centric story visualization via visual planning and token alignment. In EMNLP, 2022.
  6. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  7. Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2022.
  8. Improved visual story generation with adaptive context modeling. arXiv preprint arXiv:2305.16811, 2023.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022.
  10. Talecrafter: Interactive story visualization with multiple characters. arXiv preprint arXiv:2305.18247, 2023.
  11. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  12. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
  13. Imagine this! scripts to compositions to videos. In ECCV, 2018.
  14. Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
  15. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
  16. Denoising diffusion probabilistic models. NeurIPS, 2020.
  17. Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900, 2023.
  18. An image is worth multiple words: Learning object level concepts using multi-concept prompt learning. arXiv preprint arXiv:2310.12274, 2023.
  19. Alias-free generative adversarial networks. NeurIPS, 2021.
  20. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  21. Auto-encoding variational {{\{{Bayes}}\}}. In ICLR.
  22. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  23. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  24. Luke Leighfield. How to storyboard. https://boords.com/how-to-storyboard, 2023.
  25. Bowen Li. Word-level fine-grained story visualization. In ECCV, 2022.
  26. Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
  27. Intelligent grimm–open-ended visual storytelling via latent diffusion models. arXiv preprint arXiv:2306.00973, 2023.
  28. Compositional visual generation with composable diffusion models. In ECCV, 2022.
  29. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  30. Tf-icon: Diffusion-based training-free cross-domain image composition. arXiv preprint arXiv:2307.12493, 2023.
  31. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  32. Integrating visuospatial, linguistic, and commonsense structure into story visualization. In EMNLP, 2021.
  33. Improving generation and evaluation of visual stories via semantic consistency. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  34. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In ECCV, 2022.
  35. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  36. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
  37. Learning transferable visual models from natural language supervision. In ICML, 2021.
  38. Make-a-story: Visual memory conditioned consistent story generation. In CVPR, 2023.
  39. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  40. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  41. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  42. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  43. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  45. Neural discrete representation learning. NeurIPS, 2017.
  46. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
  47. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  48. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
  49. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.