PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation (2404.00995v3)
Abstract: Visual layout plays a critical role in graphic design fields such as advertising, posters, and web UI design. The recent trend towards content-aware layout generation through generative models has shown promise, yet it often overlooks the semantic intricacies of layout design by treating it as a simple numerical optimization. To bridge this gap, we introduce PosterLlama, a network designed for generating visually and textually coherent layouts by reformatting layout elements into HTML code and leveraging the rich design knowledge embedded within LLMs. Furthermore, we enhance the robustness of our model with a unique depth-based poster augmentation strategy. This ensures our generated layouts remain semantically rich but also visually appealing, even with limited data. Our extensive evaluations across several benchmarks demonstrate that PosterLlama outperforms existing methods in producing authentic and content-aware layouts. It supports an unparalleled range of conditions, including but not limited to unconditional layout generation, element conditional layout generation, layout completion, among others, serving as a highly versatile user manipulation tool.
- Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13642–13652, 2021.
- Automatic stylistic manga layout. ACM Transactions on Graphics (TOG), 31(6):1–10, 2012.
- Geometry aligned variational transformer for image-conditioned layout generation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1561–1571, 2022.
- Layoutdm: Transformer-based diffusion model for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18349–18358, 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344, 2023.
- Erasing concepts from diffusion models. arXiv preprint arXiv:2303.07345, 2023.
- Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1004–1014, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Posterlayout: A new benchmark and approach for content-aware visual-textual presentation layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6018–6026, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Unifying layout generation with a decoupled diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1942–1951, 2023.
- Layoutdm: Discrete diffusion model for controllable layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10167–10176, 2023.
- Layoutformer++: Conditional graphic layout generation via constraint serialization and decoding space restriction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18403–18412, 2023.
- Coarse-to-fine generative modeling for graphic layouts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 1096–1103, 2022.
- Layoutvae: Stochastic scene layout generation from a label set. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9895–9904, 2019.
- Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia, pages 88–96, 2021.
- Blt: bidirectional layout transformer for controllable layout generation. In European Conference on Computer Vision, pages 474–490. Springer, 2022.
- Bricolage: example-based retargeting for web design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2197–2206, 2011.
- Joel Lamy-Poirier. Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models. arXiv preprint arXiv:2106.02679, 2021.
- Relation-aware diffusion model for controllable poster layout generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, page 1249–1258, 2023.
- Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767, 2019.
- Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics, 27(10):4039–4048, 2020.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023.
- Layoutprompter: Awaken the design ability of large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
- Learning layouts for single-pagegraphic designs. IEEE transactions on visualization and computer graphics, 20(8):1200–1213, 2014.
- Ruite: Refining ui layout aesthetics using transformer encoder. In 26th International Conference on Intelligent User Interfaces-Companion, pages 81–83, 2021.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Adaptive layout for dynamically aggregated documents. In Proceedings of the 13th international conference on Intelligent user interfaces, pages 99–108, 2008.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Layoutnuwa: Revealing the hidden layout expertise of large language models. arXiv preprint arXiv:2309.09506, 2023.
- Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054, 2023.
- Kota Yamaguchi. Canvasvae: Learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5481–5489, 2021.
- Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.
- LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Joint learning of salient object detection, depth estimation and contour extraction. IEEE Transactions on Image Processing, 31:7350–7362, 2022.
- Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG), 38(4):1–15, 2019.
- Composition-aware graphic layout gan for visual-textual presentation designs. arXiv preprint arXiv:2205.00303, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Jaejung Seol (1 paper)
- Seojun Kim (2 papers)
- Jaejun Yoo (38 papers)