PosterCopilot: Framework for Poster Design
- PosterCopilot is a framework for professional poster design that integrates geometric layout reasoning, aesthetic feedback, and granular, iterative layer editing.
- Its architecture couples a Design Master with a generative agent, employing a progressive three-stage training approach to boost controllability and visual fidelity.
- Experimental results demonstrate over 74% human win-rate in layout rationality and style consistency, validating its structured, iterative refinement process.
PosterCopilot is a framework for professional poster graphic design that unifies geometric layout reasoning, aesthetic feedback, and granular, iterative layer editing. It addresses longstanding fundamental limitations in LMM-driven graphic design systems, particularly regarding geometric accuracy, structured editing, and the capacity for professional multi-stage workflows. PosterCopilot leverages a progressive three-stage training approach combined with an LMM-powered generative agent and structured layout representation, delivering industry-leading controllability, fidelity, and visual appeal (Wei et al., 3 Dec 2025).
1. Motivation and Design Challenges
Conventional LMM-based or diffusion-based poster generators commonly suffer from two critical drawbacks. First, representing element coordinates as discrete tokens leads to geometric signal collapse—near-zero in token space—and severe misalignments, particularly in posters with multiple, overlapping assets. Second, most pipelines only generate initial drafts; they lack mechanisms for round-trip, lossless editing at the layer level, which undermines their utility for professional, iterative workflows where precise element localization, resizing, and style tuning are essential. Diffusion approaches generate all regions in tandem, precluding single-layer refinement without introducing global visual artifacts (Wei et al., 3 Dec 2025).
2. Architecture and Workflow
PosterCopilot comprises two principal subsystems: the design model (termed “Design Master”) and a generative agent. The workflow encompasses multi-stage asset acquisition, layout reasoning, and user-driven iterative editing steps:
- Asset Initialization: If required, missing visual assets are synthesized by a “reception model” (Qwen-2.5-7B) that generates textual plus stylistic descriptions. These are then rendered using a T2I engine (Qwen-Image-Edit-2509).
- Structured Layout: The Design Master accepts multimodal prompts—rasterized text layers, images, shapes, canvas dimensions—and emits a graphical layout , where each is a bounding box and denotes z-order (layer stacking).
- Iterative Editing: Users can command explicit layer manipulations (position, scale, replacement). The system parses commands, then either:
- Triggers targeted asset refinement via controlled T2I synthesis, or
- Propagates updates to the Design Master, with all non-target layers frozen to preserve global context and consistency.
- Global Layout Update: Following edits, the layout is re-optimized, ensuring spatial and stylistic harmony across layers.
- Cycle: Steps 3–4 repeat until the desired refinement is achieved.
This pipeline supports both end-to-end draft generation and highly controllable, “fused context” round-trip editing (Wei et al., 3 Dec 2025).
3. Progressive Three-Stage Training
The Design Master model is trained in three distinct phases, each targeting a different facet of layout intelligence:
a. Perturbed Supervised Fine-Tuning (PSFT)
PSFT mitigates the geometric coarseness induced by tokenization. Training is performed over a set of Gaussian-perturbed variants centered at the ground-truth layout:
This yields a smoother, more accurate mapping from prompt to spatial structure by rewarding neighborhood geometry, not just discrete ground truth (Wei et al., 3 Dec 2025).
b. Reinforcement Learning for Visual-Reality Alignment (RL-VRA)
RL-VRA introduces verifiable geometric feedback. The reward combines spatial coherence (DIoU), aspect-ratio penalties, size penalties, and binary syntactic validity checks:
Optimization employs GRPO for high-dimensional action spaces, driving the policy towards layouts with superior geometric and stacking fidelity (Wei et al., 3 Dec 2025).
c. Reinforcement Learning from Aesthetic Feedback (RL-AF)
RL-AF adds an aesthetic reward () derived from a pre-trained LMM evaluator (VisualQuality-R1):
This objective encourages composition diversity without sacrificing geometric sensibility. The interplay between RL-VRA and RL-AF achieves state-of-the-art performance in both geometric precision and human-judged aesthetics (Wei et al., 3 Dec 2025).
4. Structured Layout Representation and Geometric Metrics
PosterCopilot formalizes the layout as a set of bounding boxes with z-order, with each layer classified by type (image, text, shape). The internal geometry is evaluated using:
- IoU (Intersection-over-Union) for direct spatial overlap
- DIoU (Distance IoU) incorporating center distance for alignment
- IOPR (Inverse Order Pair Ratio) to assess stacking sequence accuracy
- ARD (Aspect-Ratio Distortion) to penalize skewed or compressed elements
These metrics underpin both model optimization and experimental benchmarking, yielding precise and interpretable diagnostics of layout quality (Wei et al., 3 Dec 2025).
5. Controllable, Layer-Specific Editing
The agent-based interaction paradigm enables granular modification:
- When a layer is edited, all other layers are contextually frozen in the prompt, stabilizing the global design during localization of the update.
- If user feedback or instruction signals new asset content, the system routes the request to T2I with style references for asset synthesis.
- Layer insertions, deletions, or parameter tweaks are immediately reflected in both the structured layout and the rendered composite.
This fusion of structure-aware modeling and generative editing allows for repeated, lossless modifications—a requisitely professional interaction model rarely achieved in prior art (Wei et al., 3 Dec 2025, Zhang et al., 12 Jun 2025).
6. Experimental Results and Benchmarks
PosterCopilot is evaluated on the PosterCopilot Corpus, consisting of 160 K professional-level PSD posters with fine-granular layer decompositions. Baselines include commercial platforms (Microsoft Designer, Nano-Banana) and advanced academic systems (LaDeCo, CreatiPoster). Key results include:
- Human pairwise win-rate against each baseline exceeds 74% on all major criteria: layout rationality, text legibility, element preservation, style consistency, instruction following, and visual appeal.
- Automated GPT-5 in-context scoring confirms PosterCopilot’s lead in all metrics except minor deficits in text legibility (particularly on small text) versus Nano-Banana, which tends to omit minor text for legibility maximization.
- Ablation studies demonstrate cumulative improvements in geometric and aesthetic metrics across the PSFT, RL-VRA, and RL-AF training stages, with spatial coherence reward yielding the largest IoU gain (Wei et al., 3 Dec 2025).
7. Limitations and Future Directions
PosterCopilot’s current aesthetic feedback is derived from a domain-general LMM (VisualQuality-R1), lacking a poster-specific learned reward. Compositing options are limited to standard blend modes. RL-AF incurs a minor decrease in IoU when aesthetic exploration is favored. Future research trajectories include:
- Training a dedicated “poster aesthetic” reward model on human-provided annotations.
- Expanding the compositing engine to support blend/mask variations and fine-grained effects.
- Meta-learning-based hyperparameter calibration for automated policy reward balancing.
- Integration with real-time, interactive multimodal assistance to facilitate conversational and instruction-driven professional design workflows (Wei et al., 3 Dec 2025).
Key related works: AutoPoster introduces an automated content-aware generation pipeline emphasizing image cleaning, layout, tagline, and style attribute modules (Lin et al., 2023). PosterCraft moves toward unified diffusion-based architectures with preference reward feedback (Chen et al., 12 Jun 2025). CreatiPoster provides a JSON protocol-driven approach for editable multi-layer designs (Zhang et al., 12 Jun 2025). PosterCopilot builds on these paradigms, specializing in geometric accuracy, iterative controllability, and advanced LMM alignment.