LLM-Interleaved: Modular Multimodal Generation
- LLM-Interleaved is a modular framework that leverages an LLM-based planner to select and invoke various specialized visual tools for accurate image-text generation.
- It decouples high-level language planning from low-level visual synthesis, overcoming the one-tool bottleneck to ensure both factual grounding and creative output.
- The framework employs a hybrid reinforcement learning strategy and parallel test-time scaling to enhance output quality and maintain robust tool accuracy.
LLM-Interleaved (LLM-I) is a dynamic framework for multimodal generation that positions LLMs or multimodal LLMs (MLLMs) as central agentic planners capable of orchestrating a heterogeneous toolkit of specialized visual tools. Rather than requiring a single unified model to synthesize all image-text interleaved outputs, LLM-I reframes the challenge as a tool-use and planning problem where the agent must read textual prompts, select appropriate external resources (e.g., image search, code execution, diffusion-based generation, image editing), and compose structured sequences of text and images. This modular strategy overcomes the “one-tool bottleneck” and supports tasks that demand both factual grounding and creative or programmatic visual generation.
1. Framework Architecture and Core Principles
LLM-I is architected as an agentic, planner-driven pipeline where the central LLM or MLLM reads the user's multimodal request and outputs an interleaved sequence consisting of text segments and explicit tool-calls, formatted as structured tags (e.g., <imgen>{...}</imgen>
). These tags encode the required operation (search, edit, code execution, diffusion), image description, source type, and additional parameters. Upon receiving this output, a downstream dispatcher routes each tag to the corresponding specialized visual module.
The system's toolkit comprises:
- Online image search for factual images (e.g., via commercial image APIs),
- Diffusion-based generators (e.g., Seedream 3.0) for synthetic/creative imagery,
- Code execution modules for procedural visualizations (e.g., Python charting in sandboxed environments),
- Image editing tools (e.g., Seededit 3.0) for post-processing, annotation, or enhancement.
The agent is trained to reason about when and how to invoke each tool, relying on the content and constraints of the prompt and the ongoing multimodal context.
2. Tool Utilization and System Modularity
A chief innovation of LLM-I is the principled separation between high-level language planning and low-level visual synthesis. Unified models, which generate all visuals from a shared latent space, are fundamentally limited to synthetic imagery and struggle with requests that require real-world factuality or data-driven visualization.
LLM-I’s tag-based invocation system enables flexible tool selection, permitting the central agent to:
- Call an image search module for up-to-date images (addressing factuality),
- Delegate to a diffusion generator for imaginative content,
- Execute code for precise, programmatic figures (e.g., data plots),
- Apply an editor for image adjustments.
This decoupling simplifies extensibility: adding a new visual tool does not require retraining the monolithic core. The agent’s outputs remain composable and updateable as new tool capabilities emerge.
3. Reinforcement Learning and Hybrid Reward Design
The agentic planner is trained using a composite reinforcement learning (RL) paradigm:
- Rule-based reward (): Enforces explicit output constraints, such as the required number of images , penalizing over- or under-generation:
where is a penalty factor.
- LLM judge reward (): Uses an external LLM as a critic to score fluency, coherence, and the integration quality of tool-call tags, mapped to a normalized scale.
- MLLM judge reward (): After candidate images are generated, a multimodal LLM scores technical/aesthetic quality, semantic fit to surrounding text, and overall contextual relevance.
The total reward is a weighted sum:
This hybrid design encourages both structural fidelity and multimodal quality, reducing reward hacking and promoting robust tool invocation (Guo et al., 17 Sep 2025).
4. Benchmarking, Evaluation, and Test-Time Scaling
LLM-I has been evaluated across four multimodal benchmarks: OpenING, ISG, LLMI-Bench, and an in-domain test set.
- Metrics include text fluency/coherence, image-text (IT) alignment, multi-step narrative consistency (MS consistency), image generation quality, and tool accuracy (correct selection and invocation).
- Models such as MLLM-I-32B and LLM-I-30B achieve top scores, approaching 100% tool-call success under text-only constraints and surpass prior unified/two-stage models.
- A novel test-time scaling (TTS) strategy executes parallel candidate generation:
- Multiple candidate outputs are sampled and structurally checked.
- Tool calls are executed concurrently for each candidate.
- A selector model (LLM or MLLM) then ranks candidates; failed code executions are iteratively refined via feedback.
- A final polishing step ensures the selected output maintains optimal image-text integration.
This process, which adds a small inference latency overhead but offers substantive quality improvements, is shown to outperform even larger model baselines in reliability and output excellence (Guo et al., 17 Sep 2025).
5. Dataset Construction and Model Backbones
LLM-I training utilizes a uniquely “tool-oriented” dataset:
- Prompts are scaffolded to specify required tool usage, task themes, and strict image count constraints (with labels: disallowed, unconstrained, exact-, or at-least-one).
- The final dataset comprises approximately 4,000 validated examples, including both text-only and synthetic text-image interleaved samples.
- Four backbone models are employed: Qwen3-4B-Instruct, Qwen3-30B-Instruct (MoE), Qwen2.5-VL-7B, Qwen2.5-VL-32B. Training applies RL algorithms (GRPO, GSPO), scheduled learning rates (with cosine decay), and designated reward trade-off coefficients.
6. Implications and Future Directions
LLM-I enables agentic planning over heterogeneous multimodal toolkits. The explicit orchestration of external tools allows the system to:
- Address both factual and creative image-text tasks (overcoming the “one-tool” bottleneck).
- Provide rigorous performance via hybrid RL rewards and multi-step scaling.
- Maintain extensibility for future tool modules.
- Support tasks in domains requiring factual images (e.g., real-time information), precise data visualization, or flexible image customization.
- Empirically demonstrate benchmark-leading performance.
A plausible implication is that LLM-I’s modular, planner-driven approach will serve as a template for future multimodal systems tasked with integrating broader toolsets, not only for image-text generation but for structured reasoning, code synthesis, and online grounding.
Table: Core Structural Features of LLM-I Framework
Component | Role | Notes |
---|---|---|
Central Agent | Interleaved planner and tool selector | LLM or MLLM |
Tool Invocation | Tag-based, structured calls to external modules | <imgen>{...} |
Visual Toolkit | Search, diffusion, code, editing | Modular |
Rewards | Rule-based, LLM judge, MLLM judge | Hybrid RL |
Test-Time Scaling | Parallel candidate gen. + selection | Inference |
Model Backbones | Qwen / Qwen-VL (MoE, MLLM) | 4 variants |
Dataset | Tool-oriented, constraint-labeled | 4,000 samples |
LLM-Interleaved (LLM-I) represents a comprehensive agentic solution to the multimodal generation problem, providing flexibility, factual grounding, scalable architecture, and empirically validated performance benefits (Guo et al., 17 Sep 2025).