CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Published 3 Apr 2026 in cs.CV | (2604.03156v1)

Abstract: Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces CAMEO, a multi-agent framework that enhances multi-constraint consistency through closed-loop, quality-aware regulation.
It decomposes image editing into strategic orchestration, adaptive reference grounding, and iterative quality control to tackle semantic, geometric, and contextual challenges.
Empirical results show a 20% win rate improvement over baselines in tasks like road anomaly insertion and human pose switching, validating its practical effectiveness.

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Introduction and Motivation

Conditional image editing involves modifying a given input image according to textual prompts and optional references, increasingly applied in workflows requiring high-fidelity control, such as anomaly insertion in autonomous driving scenarios and human pose transformation for data curation. Prevailing diffusion-based image editing paradigms, while capable in semantic alignment and image realism, are fundamentally limited by their single-step open-loop architectures. These systems offer limited explicit quality control and frequently introduce semantic, geometric, or contextual inconsistencies, particularly under complex multi-constraint instructions. Prompt engineering and repeated sampling are typically required to mitigate such errors, indicating intrinsic deficiencies in their generation strategies.

In response, the CAMEO framework is introduced as a structured, multi-agent system that views image editing as an explicitly regulated, closed-loop optimization problem, embedding evaluation and iterative refinement directly into the editing pipeline. Through hierarchical orchestration, dynamic constraint activation, adaptive use of external references, and quality-aware feedback mechanisms, CAMEO demonstrates statistically significant gains in robustness, controllability, and multi-constraint consistency compared to leading one-shot methods.

Figure 1: Comparison of CAMEO with contemporary state-of-the-art image editing models, highlighting improved structural consistency and semantic adherence.

The paper identifies critical challenges in current conditional image editing:

Multi-Constraint Enforcement: Structurally difficult tasks (e.g., complex anomaly insertion, human pose switching) are poorly handled by open-loop models.
Absence of Intrinsic Quality Regulation: Editing quality is treated as a post hoc metric rather than a controllable parameter during synthesis, resulting in persistent artifacts.
Static Reference Conditioning: Uniform application of reference images or maps results in suboptimal performance and possible overconstraining.

Figure 2: Failure cases from state-of-the-art conditional image editing on BDD100K—common issues include lack of semantic alignment, physical implausibility, and structural artifacts.

While recent work on LMM-based editing models, tree-based planners, and modular agent frameworks introduces forms of iterative reasoning and structured prompting, these approaches either lack unified architectural integration or do not combine control, reference grounding, and quality-aware refinement within a single system. Existing benchmarks also inadequately probe scenarios requiring simultaneous satisfaction of semantic, physical, and contextual constraints.

CAMEO Multi-Agent Architecture

Hierarchy and Functional Decomposition

CAMEO divides the image editing process into three agent tiers:

Orchestration (Strategic Director): Interprets instruction intent, determines active constraints, and dynamically triggers reference agents as task complexity demands.
Utility (Instruction Architect, Visual Research Specialist, Generative Creator): Translates high-level instructions into constraint-enriched prompts, sources or generates reference images (textual, visual, hybrid), then synthesizes candidate edited images.
Regulation (Quality Critic, Refinement Editor): Performs dimension-specific evaluation of intermediate outputs and triggers targeted corrective edits until active constraints are satisfied or a cutoff is reached.
Figure 3: Schematic of the CAMEO multi-agent workflow—including strategic planning, modular agent deployment, iterative evaluation, and refinement.

Central to the framework is the closed-loop quality-aware editing process, where each candidate output is explicitly evaluated against task-adaptive criteria (e.g., semantic correctness, physical plausibility, contextual coherence), and refinements are directed by structured diagnostic feedback, not mere resampling. The reference input strategy is adaptive, invoked selectively based on assessed task difficulty and transformation magnitude.

Experimental Evaluation

Tasks and Protocol

CAMEO is evaluated on two representative domains:

Road Anomaly Insertion (BDD100K, 10k samples): Robustness and fidelity under complex scene modifications involving structured anomaly placement and weather changes.
Human Pose Switching (custom benchmark, 10k samples): Structural transformation and anatomical coherence under large pose manipulations.

Backbone models include Qwen Image Edit Plus, FLUX 2 Pro, Seedream 4.5, and Nano Banana Pro. Performance is assessed using both automated VLM-based judges (Qwen3-VL-Plus, GPT-4o, Gemini-2.5, Claude-Opus-4.5) and comprehensive human preference studies.

Main Results

CAMEO decisively outperforms direct editing baselines:

Road Anomaly Insertion: Average win rate improvement of 20% across all backbone/judge pairs, with substantive gains in physical plausibility and contextual coherence.
Human Pose Switching: Average 20% higher win rate, especially notable for severe pose transformations requiring adaptive reference use and multistep correction.

Figure 4: CAMEO achieves superior semantic correctness and physical plausibility over direct editing, substantially mitigating common artifacts such as context-inconsistent insertions.

Figure 5: CAMEO demonstrates improved boundary blending and more coherent integration with the original scene structure.

Figure 6: Qualitative comparison on diverse human pose switching tasks. CAMEO produces anatomically well-aligned figures and more realistic articulation than single-pass baselines.

Ablation Analysis

Critical components of the CAMEO architecture were individually removed to assess their contribution:

Adaptive Reference Grounding: Its removal reliably degraded geometric fidelity and context coherence.
Quality Control: Omitting the Quality Critic and Refinement Editor led to increased failure rates due to uncorrected artifacts.
Iterative Refinement: Disabling this pathway produced stagnant errors, especially on tasks with heavy structural modification.
Figure 7: Ablation study visualizations—full CAMEO (left) versus variants missing key modules, illustrating increased artifacts and reduced semantic fidelity when regulation and adaptive reference mechanisms are omitted.

Implications and Future Directions

CAMEO empirically validates the importance of closed-loop, quality-aware regulation and dynamic reference management in conditional image editing, setting a high bar for multi-agent and hierarchical decomposition in high-constraint visual tasks. From a theoretical perspective, the framework aligns editing with classical closed-loop control, with explicit monitoring and correction driving convergence to robust solutions. Practically, these advances will accelerate adoption in high-stakes domains such as autonomous system simulation, digital content creation, and data-centric AI.

The remaining limitations mainly concern the computational cost of multi-stage orchestration and the ultimate ceiling imposed by the underlying backbone editing and evaluation models. Future research avenues include improved edit trajectory modeling, novel agent collaboration strategies, and the development of richer, more granular evaluation metrics capable of capturing subtle structural errors.

Figure 8: Human evaluation interface used for pairwise image assessment in the study.

Conclusion

CAMEO reframes conditional image editing as an iterative, explicit-constraint optimization process, leveraging hierarchical agent orchestration, adaptive reference grounding, and closed-loop quality regulation. The reported results establish robust and controllable improvements over state-of-the-art single-pass pipelines, especially under challenging multi-constraint scenarios. The proposed methodology and evaluation strategies indicate promising directions for the next generation of AI-driven image editing systems, particularly in specialized domains demanding strict semantic, physical, and contextual alignment.

Reference: "CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator" (2604.03156)

Markdown Report Issue