- The paper introduces CAMEO, a multi-agent framework that enhances multi-constraint consistency through closed-loop, quality-aware regulation.
- It decomposes image editing into strategic orchestration, adaptive reference grounding, and iterative quality control to tackle semantic, geometric, and contextual challenges.
- Empirical results show a 20% win rate improvement over baselines in tasks like road anomaly insertion and human pose switching, validating its practical effectiveness.
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
Introduction and Motivation
Conditional image editing involves modifying a given input image according to textual prompts and optional references, increasingly applied in workflows requiring high-fidelity control, such as anomaly insertion in autonomous driving scenarios and human pose transformation for data curation. Prevailing diffusion-based image editing paradigms, while capable in semantic alignment and image realism, are fundamentally limited by their single-step open-loop architectures. These systems offer limited explicit quality control and frequently introduce semantic, geometric, or contextual inconsistencies, particularly under complex multi-constraint instructions. Prompt engineering and repeated sampling are typically required to mitigate such errors, indicating intrinsic deficiencies in their generation strategies.
In response, the CAMEO framework is introduced as a structured, multi-agent system that views image editing as an explicitly regulated, closed-loop optimization problem, embedding evaluation and iterative refinement directly into the editing pipeline. Through hierarchical orchestration, dynamic constraint activation, adaptive use of external references, and quality-aware feedback mechanisms, CAMEO demonstrates statistically significant gains in robustness, controllability, and multi-constraint consistency compared to leading one-shot methods.
Figure 1: Comparison of CAMEO with contemporary state-of-the-art image editing models, highlighting improved structural consistency and semantic adherence.
The paper identifies critical challenges in current conditional image editing:
- Multi-Constraint Enforcement: Structurally difficult tasks (e.g., complex anomaly insertion, human pose switching) are poorly handled by open-loop models.
- Absence of Intrinsic Quality Regulation: Editing quality is treated as a post hoc metric rather than a controllable parameter during synthesis, resulting in persistent artifacts.
- Static Reference Conditioning: Uniform application of reference images or maps results in suboptimal performance and possible overconstraining.



Figure 2: Failure cases from state-of-the-art conditional image editing on BDD100K—common issues include lack of semantic alignment, physical implausibility, and structural artifacts.
While recent work on LMM-based editing models, tree-based planners, and modular agent frameworks introduces forms of iterative reasoning and structured prompting, these approaches either lack unified architectural integration or do not combine control, reference grounding, and quality-aware refinement within a single system. Existing benchmarks also inadequately probe scenarios requiring simultaneous satisfaction of semantic, physical, and contextual constraints.
CAMEO Multi-Agent Architecture
Hierarchy and Functional Decomposition
CAMEO divides the image editing process into three agent tiers:
- Orchestration (Strategic Director): Interprets instruction intent, determines active constraints, and dynamically triggers reference agents as task complexity demands.
- Utility (Instruction Architect, Visual Research Specialist, Generative Creator): Translates high-level instructions into constraint-enriched prompts, sources or generates reference images (textual, visual, hybrid), then synthesizes candidate edited images.
- Regulation (Quality Critic, Refinement Editor): Performs dimension-specific evaluation of intermediate outputs and triggers targeted corrective edits until active constraints are satisfied or a cutoff is reached.
Figure 3: Schematic of the CAMEO multi-agent workflow—including strategic planning, modular agent deployment, iterative evaluation, and refinement.
Central to the framework is the closed-loop quality-aware editing process, where each candidate output is explicitly evaluated against task-adaptive criteria (e.g., semantic correctness, physical plausibility, contextual coherence), and refinements are directed by structured diagnostic feedback, not mere resampling. The reference input strategy is adaptive, invoked selectively based on assessed task difficulty and transformation magnitude.
Experimental Evaluation
Tasks and Protocol
CAMEO is evaluated on two representative domains:
- Road Anomaly Insertion (BDD100K, 10k samples): Robustness and fidelity under complex scene modifications involving structured anomaly placement and weather changes.
- Human Pose Switching (custom benchmark, 10k samples): Structural transformation and anatomical coherence under large pose manipulations.
Backbone models include Qwen Image Edit Plus, FLUX 2 Pro, Seedream 4.5, and Nano Banana Pro. Performance is assessed using both automated VLM-based judges (Qwen3-VL-Plus, GPT-4o, Gemini-2.5, Claude-Opus-4.5) and comprehensive human preference studies.
Main Results
CAMEO decisively outperforms direct editing baselines:
- Road Anomaly Insertion: Average win rate improvement of 20% across all backbone/judge pairs, with substantive gains in physical plausibility and contextual coherence.
- Human Pose Switching: Average 20% higher win rate, especially notable for severe pose transformations requiring adaptive reference use and multistep correction.

Figure 4: CAMEO achieves superior semantic correctness and physical plausibility over direct editing, substantially mitigating common artifacts such as context-inconsistent insertions.
Figure 5: CAMEO demonstrates improved boundary blending and more coherent integration with the original scene structure.
Figure 6: Qualitative comparison on diverse human pose switching tasks. CAMEO produces anatomically well-aligned figures and more realistic articulation than single-pass baselines.
Ablation Analysis
Critical components of the CAMEO architecture were individually removed to assess their contribution:
Implications and Future Directions
CAMEO empirically validates the importance of closed-loop, quality-aware regulation and dynamic reference management in conditional image editing, setting a high bar for multi-agent and hierarchical decomposition in high-constraint visual tasks. From a theoretical perspective, the framework aligns editing with classical closed-loop control, with explicit monitoring and correction driving convergence to robust solutions. Practically, these advances will accelerate adoption in high-stakes domains such as autonomous system simulation, digital content creation, and data-centric AI.
The remaining limitations mainly concern the computational cost of multi-stage orchestration and the ultimate ceiling imposed by the underlying backbone editing and evaluation models. Future research avenues include improved edit trajectory modeling, novel agent collaboration strategies, and the development of richer, more granular evaluation metrics capable of capturing subtle structural errors.
Figure 8: Human evaluation interface used for pairwise image assessment in the study.
Conclusion
CAMEO reframes conditional image editing as an iterative, explicit-constraint optimization process, leveraging hierarchical agent orchestration, adaptive reference grounding, and closed-loop quality regulation. The reported results establish robust and controllable improvements over state-of-the-art single-pass pipelines, especially under challenging multi-constraint scenarios. The proposed methodology and evaluation strategies indicate promising directions for the next generation of AI-driven image editing systems, particularly in specialized domains demanding strict semantic, physical, and contextual alignment.
Reference: "CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator" (2604.03156)