AgentDrive-Gen: Agentic Text-to-Image System
- AgentDrive-Gen is an agentic multimodal reasoning system that decouples visual understanding and image generation through iterative chains-of-thought, tool invocation, judgment, and reflection.
- It integrates a fine-tuned multimodal language model with external diffusion-based image generators, optimized via supervised fine-tuning and reinforcement learning to reduce errors and improve output quality.
- Empirical evaluations demonstrate significant gains in metrics such as GenEval++ and WISE, showcasing the system’s modularity, interpretability, and scalability in multi-turn reasoning tasks.
AgentDrive-Gen is an agentic multimodal reasoning system for text-to-image generation, introduced as "GenAgent" by Kaixun Zhang et al. Its core innovation is to decouple visual understanding and generation via an agentic framework: a large vision-LLM (MLLM) conducts iterative reasoning, tool invocation, judgment, and reflection steps, while specialized image-generation models serve as black-box "tools." This interaction enables autonomous, multi-turn chains-of-thought for improved output fidelity, interpretability, and scalability, distinct from pipeline-constrained modular systems and monolithic end-to-end generative models (Jiang et al., 26 Jan 2026).
1. System Architecture and Inference Dynamics
AgentDrive-Gen comprises two principal classes of components:
- Policy Model (): A multimodal LLM (e.g., Qwen2.5-VL-7B) fine-tuned for agentic orchestration.
- External Image Generators: Black-box diffusion models (notably FLUX.1-dev and Qwen-Image) acting as parameterizable tools.
A typical inference session proceeds as follows:
- User Query (): The system receives a text prompt.
- Thought & Tool Invocation: The policy model generates an initial chain-of-thought () and corresponding tool call ().
- Generation: The chosen image generator produces in response to .
- Judgment: The agent inspects relative to , emitting a judgment trace (). The model either terminates (if is satisfactory) or continues with further reflection () and refines the prompt ().
- Reflection & Refinement: If required, is invoked on the generator to obtain . The process iterates ("think → call → judge → reflect") for up to rounds (typically 3).
A trajectory comprises the full sequence , terminating when the system issues a satisfying judgment signal (). This design yields a highly modular and interpretable workflow, with the agent capable of dynamic, multi-turn reasoning and adaptive resource allocation (Jiang et al., 26 Jan 2026).
2. Training Methodologies: Supervised Fine-Tuning and Agentic RL
AgentDrive-Gen employs a two-stage agentic training pipeline:
(a) Supervised Fine-Tuning (SFT)
- Purpose: SFT addresses the high initial error rate (13.4%) in tool-call formatting, suboptimal reflection, and brevity in revisions when using an off-the-shelf .
- Trajectory Construction: Data is synthesized by leveraging a strong teacher MLLM (Qwen3-VL-235B-A22B-Thinking) to distill one-round chains-of-thought, generate images via FLUX.1-dev, and instruct explicit judgment. When failures occur, Gemini 2.5 Pro provides high-quality second-round refinements, filtered to retain trajectories with verified improvement, resulting in approximately 32,000 two-round instances.
- Optimization: Only tokens produced by are included in cross-entropy loss. SFT reduces tool-call errors to 0.35% and enhances prompt richness and reasoning trace length.
(b) Agentic Reinforcement Learning (RL)
- Objective: The RL stage optimizes a group-normalized, clipped PPO (GRPO) return over sampled trajectories. Rewards are hybrid, comprising:
- Pointwise Reward : Assigned (0 or 0.7) if a final image fully satisfies prompt conditions, as adjudicated by an auto-evaluated MLLM.
- Format Penalty : -0.2 if tool calls are misformatted, 0 otherwise.
- Pairwise Reward : 0.3 if consecutive generator outputs exhibit consistent improvement per trajectory (MLLM-judged).
- Combined Reward: , with adaptation depending on satisfaction.
- Trajectory Resampling: Per prompt, twelve rollouts are generated and eight are uniformly subsampled across possible round counts, promoting balanced exploration of 1- through 3-turn interactions.
3. Generalization Across Image Generators
The agent-tool interface is deliberately agnostic:
- Training Tool: FLUX.1-dev (8-step distilled diffusion model).
- Stronger Test-Time Tool: Qwen-Image, providing higher image fidelity with no agent retraining needed.
- Ablation Studies: When substituting FLUX.1-dev with the smaller Sana1.5, substantial improvements remained, while upgrading to Qwen-Image yielded additional performance gains. These results confirm that the policy model learns generator-invariant reasoning and control policies, scaling effectively with tool quality (Jiang et al., 26 Jan 2026).
4. Empirical Evaluation and Metrics
Extensive benchmarks demonstrate the quantitative impact of the agentic paradigm.
| Metric | FLUX.1-dev (base) | +GenAgent w/RL | +GenAgent w/Qwen-Image |
|---|---|---|---|
| GenEval++ | 0.325 | 0.561 (+23.6%) | 0.725 |
| WISE | 0.55 | 0.69 (+14%) | 0.72 |
- GenEval++: Instruction-following, compositional and spatial reasoning.
- WISE: Knowledge-grounded categories (cultural, science, spatial, temporal).
- Imagine: Creative, surrealistic attributes.
Ablation across training stages (e.g., base policy, +SFT, +RL with/without pairwise) confirms that both SFT and RL—particularly the inclusion of pairwise reward—are essential for realizing optimal multi-turn improvement. With nₘₐₓ = 2, performance approaches GPT-4o levels (GenEval++ 0.739) (Jiang et al., 26 Jan 2026).
5. Test-Time Scaling and Task Adaptivity
- Scaling Over Rounds: Each round of iterative reasoning-refinement contributes additional measurable improvement; round 1 yields the largest boost, round 2 adds further gains, round 3 delivers diminishing returns. Two rounds are optimal for most cost-benefit tradeoffs.
- Task-Adaptive Reasoning: The agent dynamically modulates its chain-of-thought traces: employing fact verification and color correction in fine-grained edits, localized mask-based reasoning for partial errors, and creativity-balancing for highly ambiguous prompts. In out-of-distribution settings (e.g., GSM8K visual math), AgentDrive-Gen adapts its strategies accordingly.
6. Key Paradigms and Implications
AgentDrive-Gen exemplifies a design where reasoning (agent) and domain-specific generation (tool) are decoupled. Major advantages and methodological insights include:
- Modularity: Upgrading the image generator directly translates to improved performance.
- Interpretability: Multi-modal chain-of-thought and judgment traces render the agent's decision process transparent.
- Scalability: The framework consistently leverages multi-turn reasoning for incremental output enhancement across diverse visual, compositional, and creative tasks.
This paradigm underscores the advantage of treating generator models as callable, upgradable modules and harnessing LLM-driven agentic reasoning for actionable, interpretable, and adaptive multimodal intelligence (Jiang et al., 26 Jan 2026).