Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

LLM-Interleaved: Modular Multimodal Generation

Updated 18 September 2025
  • LLM-Interleaved is a modular framework that leverages an LLM-based planner to select and invoke various specialized visual tools for accurate image-text generation.
  • It decouples high-level language planning from low-level visual synthesis, overcoming the one-tool bottleneck to ensure both factual grounding and creative output.
  • The framework employs a hybrid reinforcement learning strategy and parallel test-time scaling to enhance output quality and maintain robust tool accuracy.

LLM-Interleaved (LLM-I) is a dynamic framework for multimodal generation that positions LLMs or multimodal LLMs (MLLMs) as central agentic planners capable of orchestrating a heterogeneous toolkit of specialized visual tools. Rather than requiring a single unified model to synthesize all image-text interleaved outputs, LLM-I reframes the challenge as a tool-use and planning problem where the agent must read textual prompts, select appropriate external resources (e.g., image search, code execution, diffusion-based generation, image editing), and compose structured sequences of text and images. This modular strategy overcomes the “one-tool bottleneck” and supports tasks that demand both factual grounding and creative or programmatic visual generation.

1. Framework Architecture and Core Principles

LLM-I is architected as an agentic, planner-driven pipeline where the central LLM or MLLM reads the user's multimodal request and outputs an interleaved sequence consisting of text segments and explicit tool-calls, formatted as structured tags (e.g., <imgen>{...}</imgen>). These tags encode the required operation (search, edit, code execution, diffusion), image description, source type, and additional parameters. Upon receiving this output, a downstream dispatcher routes each tag to the corresponding specialized visual module.

The system's toolkit comprises:

  • Online image search for factual images (e.g., via commercial image APIs),
  • Diffusion-based generators (e.g., Seedream 3.0) for synthetic/creative imagery,
  • Code execution modules for procedural visualizations (e.g., Python charting in sandboxed environments),
  • Image editing tools (e.g., Seededit 3.0) for post-processing, annotation, or enhancement.

The agent is trained to reason about when and how to invoke each tool, relying on the content and constraints of the prompt and the ongoing multimodal context.

2. Tool Utilization and System Modularity

A chief innovation of LLM-I is the principled separation between high-level language planning and low-level visual synthesis. Unified models, which generate all visuals from a shared latent space, are fundamentally limited to synthetic imagery and struggle with requests that require real-world factuality or data-driven visualization.

LLM-I’s tag-based invocation system enables flexible tool selection, permitting the central agent to:

  • Call an image search module for up-to-date images (addressing factuality),
  • Delegate to a diffusion generator for imaginative content,
  • Execute code for precise, programmatic figures (e.g., data plots),
  • Apply an editor for image adjustments.

This decoupling simplifies extensibility: adding a new visual tool does not require retraining the monolithic core. The agent’s outputs remain composable and updateable as new tool capabilities emerge.

3. Reinforcement Learning and Hybrid Reward Design

The agentic planner is trained using a composite reinforcement learning (RL) paradigm:

  • Rule-based reward (RruleR_\text{rule}): Enforces explicit output constraints, such as the required number of images Nreq\mathrm{N_{req}}, penalizing over- or under-generation:

Rrule={NgenNreq0NgenNreq max(0,1α(NgenNreq))Ngen>NreqR_\mathrm{rule} = \begin{cases} \frac{N_\text{gen}}{N_\text{req}} & 0 \leq N_\text{gen} \leq N_\text{req} \ \max(0, 1 - \alpha(N_\text{gen} - N_\text{req})) & N_\text{gen} > N_\text{req} \end{cases}

where α\alpha is a penalty factor.

  • LLM judge reward (RLLMR_\text{LLM}): Uses an external LLM as a critic to score fluency, coherence, and the integration quality of tool-call tags, mapped to a normalized scale.
  • MLLM judge reward (RmLLMR_\text{mLLM}): After candidate images are generated, a multimodal LLM scores technical/aesthetic quality, semantic fit to surrounding text, and overall contextual relevance.

The total reward is a weighted sum:

R=wruleRrule+wLLMRLLM+wmLLM(RmLLMRrule)R = w_\text{rule} R_\text{rule} + w_\text{LLM} R_\text{LLM} + w_\text{mLLM} (R_\text{mLLM} \cdot R_\text{rule})

This hybrid design encourages both structural fidelity and multimodal quality, reducing reward hacking and promoting robust tool invocation (Guo et al., 17 Sep 2025).

4. Benchmarking, Evaluation, and Test-Time Scaling

LLM-I has been evaluated across four multimodal benchmarks: OpenING, ISG, LLMI-Bench, and an in-domain test set.

  • Metrics include text fluency/coherence, image-text (IT) alignment, multi-step narrative consistency (MS consistency), image generation quality, and tool accuracy (correct selection and invocation).
  • Models such as MLLM-I-32B and LLM-I-30B achieve top scores, approaching 100% tool-call success under text-only constraints and surpass prior unified/two-stage models.
  • A novel test-time scaling (TTS) strategy executes parallel candidate generation:
    • Multiple candidate outputs are sampled and structurally checked.
    • Tool calls are executed concurrently for each candidate.
    • A selector model (LLM or MLLM) then ranks candidates; failed code executions are iteratively refined via feedback.
    • A final polishing step ensures the selected output maintains optimal image-text integration.

This process, which adds a small inference latency overhead but offers substantive quality improvements, is shown to outperform even larger model baselines in reliability and output excellence (Guo et al., 17 Sep 2025).

5. Dataset Construction and Model Backbones

LLM-I training utilizes a uniquely “tool-oriented” dataset:

  • Prompts are scaffolded to specify required tool usage, task themes, and strict image count constraints (with labels: disallowed, unconstrained, exact-nn, or at-least-one).
  • The final dataset comprises approximately 4,000 validated examples, including both text-only and synthetic text-image interleaved samples.
  • Four backbone models are employed: Qwen3-4B-Instruct, Qwen3-30B-Instruct (MoE), Qwen2.5-VL-7B, Qwen2.5-VL-32B. Training applies RL algorithms (GRPO, GSPO), scheduled learning rates (with cosine decay), and designated reward trade-off coefficients.

6. Implications and Future Directions

LLM-I enables agentic planning over heterogeneous multimodal toolkits. The explicit orchestration of external tools allows the system to:

  • Address both factual and creative image-text tasks (overcoming the “one-tool” bottleneck).
  • Provide rigorous performance via hybrid RL rewards and multi-step scaling.
  • Maintain extensibility for future tool modules.
  • Support tasks in domains requiring factual images (e.g., real-time information), precise data visualization, or flexible image customization.
  • Empirically demonstrate benchmark-leading performance.

A plausible implication is that LLM-I’s modular, planner-driven approach will serve as a template for future multimodal systems tasked with integrating broader toolsets, not only for image-text generation but for structured reasoning, code synthesis, and online grounding.

Table: Core Structural Features of LLM-I Framework

Component Role Notes
Central Agent Interleaved planner and tool selector LLM or MLLM
Tool Invocation Tag-based, structured calls to external modules <imgen>{...}
Visual Toolkit Search, diffusion, code, editing Modular
Rewards Rule-based, LLM judge, MLLM judge Hybrid RL
Test-Time Scaling Parallel candidate gen. + selection Inference
Model Backbones Qwen / Qwen-VL (MoE, MLLM) 4 variants
Dataset Tool-oriented, constraint-labeled 4,000 samples

LLM-Interleaved (LLM-I) represents a comprehensive agentic solution to the multimodal generation problem, providing flexibility, factual grounding, scalable architecture, and empirically validated performance benefits (Guo et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LLM-Interleaved (LLM-I).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube