Papers
Topics
Authors
Recent
2000 character limit reached

ImAgent: Unified Multimodal Image Generation

Updated 21 November 2025
  • ImAgent is a unified, training-free multimodal agent that integrates reasoning, prompt engineering, and self-evaluation within a single vision-language model for image generation and editing.
  • It employs a dynamic operational loop with a policy controller that selects from diverse actions like prompt enhancement and image detail refinement to optimize output.
  • Empirical results show significant improvements over backbone-only baselines across benchmarks, highlighting its efficiency and adaptive test-time performance.

ImAgent is a unified, training-free multimodal agent framework designed for test-time scalable image generation and editing tasks, integrating reasoning, generation, and self-evaluation within a single vision-language foundation model. Unlike prior schemes, which require separate LLMs, vision-LLMs (VLMs), or custom pipelines for each subroutine, ImAgent dynamically coordinates multiple generation actions—prompt engineering, sample curation, iterative refinement—entirely within the backbone model at inference time. This training-free, self-organized loop yields significant improvements over backbone-only baselines across image generation and editing benchmarks, highlighting the practical potential of unified multimodal agents for efficient, adaptive test-time scaling (Wang et al., 14 Nov 2025).

1. Motivation and Problem Statement

Modern text-to-image (T2I) models demonstrate high visual fidelity and semantic competence but remain susceptible to sampling randomness and prompt underspecificity. Such instability manifests as visually inconsistent generations from identical or vague prompts, and a tendency to omit crucial details implied by the input text. Standard mitigation strategies include prompt rewriting by LMs, best-of-N sampling with automatic or manual selection, and self-refinement via iterative critique and revision. However, these approaches function independently, rely on substantial external resources, and incur elevated computational and memory overhead, hindering scalable deployment. Addressing these shortcomings requires a unified, training-free agent architecture capable of dynamic reasoning, generation, and self-critique, operating entirely inside a single multimodal model without external dependencies or fine-tuning.

2. Unified Architecture and Operational Loop

ImAgent is instantiated atop open-source multimodal models such as Bagel or Janus-Pro-7B. Core components are:

  • Policy Controller (πθ\pi_{\theta}): Acts as the decision-making module, formulating a state st={P0,I0,Pt,It,Ot1}s_t = \{P_0, I_0, P_t, I_t, \mathcal{O}_{t-1}\} at each step, where P0P_0 (I0I_0) denotes original prompt (image), PtP_t (ItI_t) is the current prompt (image), and Ot1\mathcal{O}_{t-1} is the history of evaluation outputs.
  • Action Space (A\mathcal{A}): Comprises five distinct generation/editing subroutines alongside a STOP action.
  • Generation Actions (faf_a for aA{a \in \mathcal{A}\setminus \{STOP}\}): Subroutines that generate new prompts, images, and/or self-evaluation signals.
  • Self-Evaluation Mechanism: Each action produces a numerical or textual critique oto_t (e.g., alignment scores, perceptual quality), appended to Ot\mathcal{O}_t to inform subsequent steps.

The operational loop proceeds as follows, with a maximum of TmaxT_{\mathrm{max}} (default 5) steps:

  1. Observe current state sts_t.
  2. Sample atπθ(ast)a_t \sim \pi_{\theta}(a | s_t).
  3. If ata_t is STOP, terminate.
  4. Otherwise, execute (Pt+1,It+1,ot)fat(Pt,It,Ot1)(P_{t+1}, I_{t+1}, o_t) \leftarrow f_{a_t}(P_t, I_t, \mathcal{O}_{t-1}).
  5. Update OtOt1{ot}\mathcal{O}_t \leftarrow \mathcal{O}_{t-1} \cup \{o_t\} and increment tt.

This process is summarized in the following pseudocode:

1
2
3
4
5
6
7
8
9
Initialize P = P0, I = None, O = []
for t in range(1, Tmax + 1):
    s_t = {P0, I0, P, I, O}
    a_t = policy_controller(s_t)
    if a_t == 'STOP':
        break
    P, I, o = action_fn[a_t](P, I, O)
    O.append(o)
return I

No modifications or fine-tuning are made to the backbone; all behavior is induced via prompting and stateful interaction (Wang et al., 14 Nov 2025).

3. Policy Controller and Action Design

The policy controller uses the current state representation sts_t (original and current prompt/image, action-observation history) to select the next action from A\mathcal{A} via the backbone’s LLM head. Formally:

atπθ(ast)a_t \sim \pi_{\theta}(a | s_t)

Action selection and execution leverage natural language prompts structured for each subroutine, providing a flexible yet template-driven mechanism for agentic decision-making. ImAgent’s action space is hand-designed and includes:

Action Description Output
Naive Generation/Editing One-shot generation with PtP_t (Pt,Inew,o=)(P_t, I_{\text{new}}, o=\emptyset)
Prompt Enhancement w/ CoT Elaborates PtP_t via chain-of-thought reasoning (Pt+1,It,ot)(P_{t+1}, I_t, o_t)
Prompt Revision Compares ItI_t to PtP_t, rewrites prompt to address gaps (Pt+1,It,ot)(P_{t+1}, I_t, o_t)
Image Detail Refinement Fixes image artifacts without changing prompt (Pt,It+1,ot)(P_t, I_{t+1}, o_t)
Best-of-N Sampling Generates N candidates, picks It+1I_{t+1} with max alignment (Pt,It+1,maxisi)(P_t, I_{t+1}, \max_i s_i)
STOP Terminates generation if current result is satisfactory

All self-evaluation information oto_t is retained in the observation history to inform subsequent policy decisions.

4. Generation Actions and Self-Evaluation Strategies

ImAgent coordinates its actions as follows:

  1. Naive Generation: Direct call to backbone’s generative capability using PtP_t; ideal when original prompt suffices.
  2. Prompt Enhancement with CoT: Employs chain-of-thought reasoning to elaborate the prompt (e.g., decomposing “a cat by a window” into relevant details), then rewrites for greater specificity and coherence.
  3. Prompt Revision Based on Image Feedback: Compares current image and prompt, generates critiques of mismatches, and proposes prompt revisions.
  4. Image Detail Refinement: Invokes denser generative routines (such as localized diffusion) to ameliorate artifacts, improving visual quality without altering the prompt.
  5. Best-of-N Sampling: Produces NN variants for PtP_t, computes alignment scores si=AlignScore(I(i),Pt)s_i = \mathrm{AlignScore}(I^{(i)}, P_t) internally, and selects the top-aligned image.
  6. STOP: Concludes iteration when the current (Pt,It)(P_t, I_t) is judged satisfactory.

Self-evaluation information per action can be numeric (alignment or perceptual quality score) or textual (e.g., description of discrepancies).

5. Test-Time Scaling and Training-Free Paradigm

ImAgent is expressly training-free: no additional model weights or gradient-based updates are introduced. Instead, its agentic behavior—action selection, generation, critique—emerges from structured prompting of a single pretrained multimodal backbone. All activities (policy decision, subroutine invocation, and self-evaluation) are natively invoked as compositional tasks within the same model scope. This avoids reliance on multiple external LLMs, VLMs, or customized test-time modules, streamlining memory and compute requirements.

Efficiency is achieved by confining execution to a few test-time steps (T_max), with the agent permitted to terminate early (via STOP) if sufficient output quality is achieved. The framework demonstrates test-time latency of a few seconds per example (Wang et al., 14 Nov 2025).

6. Empirical Evaluation

ImAgent is empirically assessed on image generation and editing benchmarks using Bagel and Janus-Pro-7B as backbones. Comparative studies against vanilla backbone use (no agent loop) demonstrate consistent improvements:

Benchmark/Metric Backbone Vanilla ImAgent Relative Improvement
R2I-Bench (Overall, Bagel) 0.54 0.62 +14.8%
WISE (Overall, Bagel) 0.52 0.63 +21.2%
T2I-ReasonBench (Acc, Bagel) 41.6 51.4 +23.6%
RISEBench (Image Edits, Bagel) 6.1 13.1 +114.8%
KRISBench (Bagel) 63.16 67.13 +6.3%
ImgEdit-Bench (Bagel) 2.89 3.15 +9.0%
GEdit-Bench (English/Chinese) +5.5%/+5.2%

ImAgent matches commercial models (e.g. GPT-4o) on RISEBench and outperforms open-source baselines across editing challenges. It also surpasses commercial Gemini-2.0 on T2I-ReasonBench in overall quality. Improvements are statistically significant across metrics and categories (Wang et al., 14 Nov 2025).

7. Ablation, Limitations, and Prospective Extensions

Ablation studies demonstrate:

  • Self-CoT (vanilla with only chain-of-thought) underperforms ImAgent, especially on GEdit-Bench and RISEBench.
  • Single-action or random policy (with Tmax=5\mathrm{T}_{\max}=5) yields lower scores than the learned action controller, validating the necessity of dynamic, state-aware sequencing.
  • On R2I-Bench, single-action ablations achieve 0.54–0.58, random policy 0.59, while full ImAgent achieves 0.62.

Current limitations include:

  • Restricted, hand-designed action space—future frameworks may autonomously discover or expand subroutines.
  • Overall performance is bottlenecked by the backbone’s generation and understanding limits; uncorrectable or undetectable errors by the backbone cannot be rectified by any agentic iteration.
  • Higher latency compared to one-shot generation, though bounded by efficient controller strategies and early termination.

Future extensions may target richer action spaces, video/3D tasks, and interactive or multi-turn deployments.

ImAgent demonstrates that unified, prompt-driven agentic workflows within a single multimodal foundation model can efficiently and adaptively enhance image generation and editing, closing the gap with bespoke, multi-module pipelines and providing a test-time scalable solution distinct in its operational efficiency and generality (Wang et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ImAgent.