ImAgent: Unified Multimodal Image Generation
- ImAgent is a unified, training-free multimodal agent that integrates reasoning, prompt engineering, and self-evaluation within a single vision-language model for image generation and editing.
- It employs a dynamic operational loop with a policy controller that selects from diverse actions like prompt enhancement and image detail refinement to optimize output.
- Empirical results show significant improvements over backbone-only baselines across benchmarks, highlighting its efficiency and adaptive test-time performance.
ImAgent is a unified, training-free multimodal agent framework designed for test-time scalable image generation and editing tasks, integrating reasoning, generation, and self-evaluation within a single vision-language foundation model. Unlike prior schemes, which require separate LLMs, vision-LLMs (VLMs), or custom pipelines for each subroutine, ImAgent dynamically coordinates multiple generation actions—prompt engineering, sample curation, iterative refinement—entirely within the backbone model at inference time. This training-free, self-organized loop yields significant improvements over backbone-only baselines across image generation and editing benchmarks, highlighting the practical potential of unified multimodal agents for efficient, adaptive test-time scaling (Wang et al., 14 Nov 2025).
1. Motivation and Problem Statement
Modern text-to-image (T2I) models demonstrate high visual fidelity and semantic competence but remain susceptible to sampling randomness and prompt underspecificity. Such instability manifests as visually inconsistent generations from identical or vague prompts, and a tendency to omit crucial details implied by the input text. Standard mitigation strategies include prompt rewriting by LMs, best-of-N sampling with automatic or manual selection, and self-refinement via iterative critique and revision. However, these approaches function independently, rely on substantial external resources, and incur elevated computational and memory overhead, hindering scalable deployment. Addressing these shortcomings requires a unified, training-free agent architecture capable of dynamic reasoning, generation, and self-critique, operating entirely inside a single multimodal model without external dependencies or fine-tuning.
2. Unified Architecture and Operational Loop
ImAgent is instantiated atop open-source multimodal models such as Bagel or Janus-Pro-7B. Core components are:
- Policy Controller (): Acts as the decision-making module, formulating a state at each step, where () denotes original prompt (image), () is the current prompt (image), and is the history of evaluation outputs.
- Action Space (): Comprises five distinct generation/editing subroutines alongside a STOP action.
- Generation Actions ( for STOP): Subroutines that generate new prompts, images, and/or self-evaluation signals.
- Self-Evaluation Mechanism: Each action produces a numerical or textual critique (e.g., alignment scores, perceptual quality), appended to to inform subsequent steps.
The operational loop proceeds as follows, with a maximum of (default 5) steps:
- Observe current state .
- Sample .
- If is STOP, terminate.
- Otherwise, execute .
- Update and increment .
This process is summarized in the following pseudocode:
1 2 3 4 5 6 7 8 9 |
Initialize P = P0, I = None, O = [] for t in range(1, Tmax + 1): s_t = {P0, I0, P, I, O} a_t = policy_controller(s_t) if a_t == 'STOP': break P, I, o = action_fn[a_t](P, I, O) O.append(o) return I |
No modifications or fine-tuning are made to the backbone; all behavior is induced via prompting and stateful interaction (Wang et al., 14 Nov 2025).
3. Policy Controller and Action Design
The policy controller uses the current state representation (original and current prompt/image, action-observation history) to select the next action from via the backbone’s LLM head. Formally:
Action selection and execution leverage natural language prompts structured for each subroutine, providing a flexible yet template-driven mechanism for agentic decision-making. ImAgent’s action space is hand-designed and includes:
| Action | Description | Output |
|---|---|---|
| Naive Generation/Editing | One-shot generation with | |
| Prompt Enhancement w/ CoT | Elaborates via chain-of-thought reasoning | |
| Prompt Revision | Compares to , rewrites prompt to address gaps | |
| Image Detail Refinement | Fixes image artifacts without changing prompt | |
| Best-of-N Sampling | Generates N candidates, picks with max alignment | |
| STOP | Terminates generation if current result is satisfactory | — |
All self-evaluation information is retained in the observation history to inform subsequent policy decisions.
4. Generation Actions and Self-Evaluation Strategies
ImAgent coordinates its actions as follows:
- Naive Generation: Direct call to backbone’s generative capability using ; ideal when original prompt suffices.
- Prompt Enhancement with CoT: Employs chain-of-thought reasoning to elaborate the prompt (e.g., decomposing “a cat by a window” into relevant details), then rewrites for greater specificity and coherence.
- Prompt Revision Based on Image Feedback: Compares current image and prompt, generates critiques of mismatches, and proposes prompt revisions.
- Image Detail Refinement: Invokes denser generative routines (such as localized diffusion) to ameliorate artifacts, improving visual quality without altering the prompt.
- Best-of-N Sampling: Produces variants for , computes alignment scores internally, and selects the top-aligned image.
- STOP: Concludes iteration when the current is judged satisfactory.
Self-evaluation information per action can be numeric (alignment or perceptual quality score) or textual (e.g., description of discrepancies).
5. Test-Time Scaling and Training-Free Paradigm
ImAgent is expressly training-free: no additional model weights or gradient-based updates are introduced. Instead, its agentic behavior—action selection, generation, critique—emerges from structured prompting of a single pretrained multimodal backbone. All activities (policy decision, subroutine invocation, and self-evaluation) are natively invoked as compositional tasks within the same model scope. This avoids reliance on multiple external LLMs, VLMs, or customized test-time modules, streamlining memory and compute requirements.
Efficiency is achieved by confining execution to a few test-time steps (T_max), with the agent permitted to terminate early (via STOP) if sufficient output quality is achieved. The framework demonstrates test-time latency of a few seconds per example (Wang et al., 14 Nov 2025).
6. Empirical Evaluation
ImAgent is empirically assessed on image generation and editing benchmarks using Bagel and Janus-Pro-7B as backbones. Comparative studies against vanilla backbone use (no agent loop) demonstrate consistent improvements:
| Benchmark/Metric | Backbone Vanilla | ImAgent | Relative Improvement |
|---|---|---|---|
| R2I-Bench (Overall, Bagel) | 0.54 | 0.62 | +14.8% |
| WISE (Overall, Bagel) | 0.52 | 0.63 | +21.2% |
| T2I-ReasonBench (Acc, Bagel) | 41.6 | 51.4 | +23.6% |
| RISEBench (Image Edits, Bagel) | 6.1 | 13.1 | +114.8% |
| KRISBench (Bagel) | 63.16 | 67.13 | +6.3% |
| ImgEdit-Bench (Bagel) | 2.89 | 3.15 | +9.0% |
| GEdit-Bench (English/Chinese) | — | +5.5%/+5.2% | — |
ImAgent matches commercial models (e.g. GPT-4o) on RISEBench and outperforms open-source baselines across editing challenges. It also surpasses commercial Gemini-2.0 on T2I-ReasonBench in overall quality. Improvements are statistically significant across metrics and categories (Wang et al., 14 Nov 2025).
7. Ablation, Limitations, and Prospective Extensions
Ablation studies demonstrate:
- Self-CoT (vanilla with only chain-of-thought) underperforms ImAgent, especially on GEdit-Bench and RISEBench.
- Single-action or random policy (with ) yields lower scores than the learned action controller, validating the necessity of dynamic, state-aware sequencing.
- On R2I-Bench, single-action ablations achieve 0.54–0.58, random policy 0.59, while full ImAgent achieves 0.62.
Current limitations include:
- Restricted, hand-designed action space—future frameworks may autonomously discover or expand subroutines.
- Overall performance is bottlenecked by the backbone’s generation and understanding limits; uncorrectable or undetectable errors by the backbone cannot be rectified by any agentic iteration.
- Higher latency compared to one-shot generation, though bounded by efficient controller strategies and early termination.
Future extensions may target richer action spaces, video/3D tasks, and interactive or multi-turn deployments.
ImAgent demonstrates that unified, prompt-driven agentic workflows within a single multimodal foundation model can efficiently and adaptively enhance image generation and editing, closing the gap with bespoke, multi-module pipelines and providing a test-time scalable solution distinct in its operational efficiency and generality (Wang et al., 14 Nov 2025).