AgenticDRS: Multi-Agent Design Evaluation
- AgenticDRS is a multi-agent design review framework that combines graph-based exemplar retrieval with structured design descriptions for comprehensive graphic design evaluation.
- It employs advanced graph matching techniques using Wasserstein and Gromov-Wasserstein distances to select context-aware exemplars and ensure precise assessments.
- Benchmark results show that AgenticDRS delivers superior accuracy and actionable feedback compared to traditional evaluation systems on standards like DRS-BENCH.
The Agentic Design Review System (AgenticDRS) is a multi-agent, orchestrated framework for the holistic evaluation of graphic designs. It adopts principles from agentic AI, combining graph-based exemplar retrieval, structured prompt expansion, and coordinated multi-agent evaluation to deliver both quantitative scores and actionable, context-aware feedback for design improvement (Nag et al., 14 Aug 2025). This approach emphasizes the aggregation of diverse, expert assessments under the coordination of a meta-agent and demonstrates superior alignment with human expert judgments compared to adapted monolithic baselines.
1. System Architecture and Overview
AgenticDRS embodies a peer-review-inspired, multi-agent architecture in which several specialized “expert” agents collaboratively analyze a given graphic design, guided by a meta-agent. The core system components include:
- Graph-based Retrieval Module (GRAD): Constructs graphs from the input design and an exemplar library to capture semantic, spatial, and structural properties. Enables retrieval of diverse, structurally analogous in-context examples using a composite graph-matching approach based on Wasserstein and Gromov-Wasserstein distances.
- Structured Design Description (SDD): Generates a rich textual description of the layout, anchored by bounding box metadata and rendered via a multimodal LLM. This contextual description guides the agents’ analysis and conditions their output.
- Agentic Evaluation Framework:
- Static Agents: Evaluate universal attributes (e.g., alignment, spacing, typography).
- Dynamic Agents: Instantiated at runtime for contextual or design-dependent attributes (e.g., grouping, stylistic coherence).
- Meta-Agent: Orchestrates the planning, aggregation, and summarization phases, selecting which agents participate and consolidating their feedback into a unified report.
The overall process consists of planning (agent selection and routing), reviewing (multi-agent analysis informed by in-context exemplars and SDD), and summarization (meta-agent aggregation and refinement of results).
2. In-Context Exemplar Selection via Graph Matching
The GRAD module is central to AgenticDRS, ensuring that agents operate on design-aware, contextually relevant comparisons rather than abstract global features. This is accomplished by the following:
- Graph Representation: Each design is represented as a graph with nodes corresponding to layout elements (cropped and CLIP-embedded) and edges encoding spatial relationships (normalized L₂ distances of centroids) and semantic similarity (cosine distance between embeddings):
- Matching Criteria:
- Wasserstein Distance (WD): Measures node-level alignment using a transport plan Φ between distributions and :
with as cosine distance. - Gromov-Wasserstein Distance (GWD): Assesses alignment of graph edge structures via:
- Global Similarity: Additional score from CLIP-based global embedding similarity.
Composite Score: The candidate in-context exemplar set is obtained by minimizing the score
with optimal found via empirical ablation (best at ). The top-K exemplars so selected ensure that each agent analyzes the query in light of functionally and structurally relevant designs.
3. Structured Prompt Expansion
Structured prompt expansion is realized by the SDD module, which transforms the layout (and its bounding boxes, when available) into an explicit, structured textual description using a multimodal LLM. This supports two core functions:
Human-Readable Narrative: The SDD details present elements, relative spatial arrangement, and hierarchical organization (e.g., “Title ‘ABC’ top center, image middle, footer text below”), enabling agents to ground their judgments contextually.
Agent Input Conditioning: The SDD is provided alongside the design image to each evaluating agent, thus anchoring outputs and reducing mode collapse phenomena (i.e., ungrounded or hallucinated attribute assessments):
- Fallback: If bounding boxes are absent, the text description is generated from the image alone:
This mechanism significantly increases the “design awareness” of the agents and is validated via ablation against baselines using only global features or random exemplar selection.
4. Multi-Agent Evaluation and Aggregation
AgenticDRS features a hybrid suite of static and dynamic agents, each focused on distinct design attributes designated in the DRS-BENCH benchmark (discussed below):
Static Agents: Evaluate canonical visual aspects using both the SDD and exemplar set; their scoring policies and feedback are fixed.
Dynamic Agents: Spawned as determined by the meta-agent to address complex, context-sensitive attributes.
Meta-Agent Role: The meta-agent aggregates agent outputs through a policy:
where consolidates incoming attribute scores and qualitative feedback into the final, actionable review.
The system architecture and evaluation phases are depicted (in the original paper’s Figure 1) as a sequential flow: Design input → GRAD → SDD → Agent suite → Meta-agent aggregation.
5. DRS-BENCH Benchmark and Experimental Results
AgenticDRS introduces DRS-BENCH, a comprehensive benchmark for evaluating design review systems:
Attributes: 15 core attributes covering alignment, color harmony, typography, hierarchy, spacing, etc.
Datasets: Four datasets: GDE (1–10 ratings), Afixa/Infographic (binary labels), and IDD (rich layout metadata).
Metrics:
- Discrete (multi-label accuracy, sensitivity, specificity).
- Continuous (Pearson correlation with human expert scores).
- Feedback Assessment (Actionable Insights Metrics, AIM).
Experimental findings include:
- Superior Accuracy: AgenticDRS (using GRAD and SDD on advanced MLLMs) improves discrete and continuous metrics by 10–13 percentage points vs. prior baselines.
- Ablation: Each major module (GRAD, SDD, bounding-box info) is validated to measurably enhance performance; omitting any reduces both accuracy and actionable insight fidelity.
Table 1: Comparison of System Configurations
Configuration | Discrete Accuracy | Continuous Correlation | AIM Score |
---|---|---|---|
Full (GRAD+SDD+BBox) | Highest | Highest | Highest |
−BBox | Lower | Lower | Lower |
−SDD | Lower | Lower | Lower |
Random In-Context | Lowest | Lowest | Lowest |
This table is a summary mapping from system variants (columns 1) to performance outcomes as stated in the data, per (Nag et al., 14 Aug 2025).
6. Actionable Feedback Generation and Summarization
Each agent produces both a quantitative score and structured qualitative feedback, citing specific elements (e.g., “inadequate contrast in header text”) guided by the SDD. The meta-agent further:
- Aggregates and Deduplicates: Combines individual agent output, filters redundancy, and organizes a prioritized list of actionable suggestions.
- Produces Unified Evaluation: Offers actionable feedback to designers, grounded in both structural context and reference exemplars.
The AIM metric evaluates feedback usefulness, where human raters note that AgenticDRS more reliably produces actionable, context-aware suggestions than previous methods.
7. Significance, Generality, and Limitations
AgenticDRS demonstrates that a multi-agent, exemplar-driven, and contextually expanded evaluation system:
- Consistently Outperforms Baselines based on state-of-the-art LLMs without design awareness mechanisms.
- Validates the Necessity of combining graph-structural and semantic retrieval for in-context selection.
- Generalizes the Peer-Review Motif for AI-mediated design assessment.
The framework’s evaluation is rigorous, integrating critical ablation and benchmark-driven analysis. There remain unexplored questions regarding transferability to non-graphic design tasks and the scalability of dynamic agent orchestration as domain complexity increases. The open-source release of the DRS-BENCH benchmark supports further research.
In summary, AgenticDRS formalizes a scalable, agentic peer-review paradigm for graphic design assessment, leveraging graph-matched exemplars and structured prompts to inform agentic multi-perspective analysis. Coordinated by a meta-agent, the multi-agent ensemble achieves state-of-the-art qualitative and quantitative performance on standardized benchmarks and sets a precedent for principled, automated generation of holistic, actionable design feedback (Nag et al., 14 Aug 2025).