Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

LLM-I: LLMs are Naturally Interleaved Multimodal Creators (2509.13642v1)

Published 17 Sep 2025 in cs.LG and cs.CV

Abstract: We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

Summary

The paper proposes an agentic framework where LLMs dynamically orchestrate diverse visual tools for interleaved multimodal generation.
It employs a hybrid reinforcement learning strategy with rule-based and judge rewards to ensure factual grounding and programmatic precision.
Benchmark results demonstrate that LLM-I consistently outperforms state-of-the-art models, validating its extensible and robust design.

LLM-I: A Tool-Augmented Agentic Framework for Interleaved Multimodal Generation

Introduction

LLM-I introduces a paradigm shift in interleaved image-text generation by reframing the problem as agentic tool use rather than monolithic end-to-end generation. The framework positions a central LLM or MLLM as an agentic planner, orchestrating a diverse suite of external visual tools—online image search, diffusion-based generation, code execution for data visualization, and image editing. This approach directly addresses the "one-tool" bottleneck of unified models, which are limited in factual grounding and programmatic precision, and enables flexible, context-appropriate multimodal content creation.

Figure 1: Overview of the LLM-I framework.

Motivation and Framework Design

Limitations of Prior Approaches

Existing methods for interleaved image-text generation fall into two categories: (1) compositional/two-stage pipelines that combine LLMs with diffusion models, and (2) unified end-to-end models. Both suffer from semantic misalignment, inflexibility, and inability to handle tasks requiring factual grounding or precise programmatic outputs. The "omniscient solver" paradigm—embedding all capabilities within a single model—proves rigid and computationally inefficient for expanding capabilities.

Agentic Tool-Orchestration

LLM-I adopts a "proficient tool-user" paradigm, where the LLM/MLLM agent dynamically selects and invokes the most appropriate tool for each visual generation need. The toolkit comprises:

Online Image Search: For factual, real-world images (e.g., landmarks, current events).
Diffusion-Based Generation: For creative, abstract, or non-existent imagery.
Code Execution: For precise data visualizations (charts, graphs) via sandboxed Python.
Image Editing: For modifying existing images, supporting iterative visual workflows.

Tool invocation is realized via structured placeholder tags (e.g., <imgen>{...}</imgen>) embedded in the LLM's output, specifying the tool, parameters, and context. A parser dispatches these calls, replaces tags with generated images, and assembles the final interleaved document.

Reinforcement Learning for Tool Mastery

Dataset Construction

A high-quality, tool-oriented RL dataset is constructed, with prompts implicitly requiring specific tools and image counts. The dataset is generated using Gemini 2.5 Pro, with categorical scaffolding for tool type, theme, image count, and difficulty. Multi-stage validation with GPT-4o ensures consistency, tool appropriateness, and cross-modal alignment, resulting in ~4k high-fidelity samples.

Figure 2: Examples in our training dataset.

Figure 3: Distribution of our constructed training dataset.

Hybrid Reward Design

The RL fine-tuning employs a composite reward:

Rule-Based Reward ( $R_{rule}$ ): Enforces image count constraints and correct tag formatting.
LLM Judge ( $R_{LLM}$ ): Assesses text fluency, coherence, and tool-use logic.
MLLM Judge ( $R_{mLLM}$ ): Evaluates image quality, semantic alignment, and task relevance.

The final reward is a weighted sum, with $R_{rule}$ acting as a multiplicative gate for $R_{mLLM}$ , ensuring visual quality is only rewarded if explicit constraints are met.

Figure 4: Rule-based Reward.

Test-Time Scaling

To further enhance reliability and output quality, LLM-I introduces a test-time scaling (TTS) framework. Multiple candidate responses are generated via stochastic sampling, filtered for valid tool calls, and ranked by a selector model. Tool-specific enhancements (e.g., parallel image search and diffusion, code repair) are applied, followed by a polishing step for multimodal coherence. The best candidate is selected as the final output.

Figure 5: Overview of test-time scaling framework for LLM-I.

Benchmarking and Empirical Results

Evaluation Protocol

LLM-I is evaluated on four benchmarks: an in-domain test set, OpenING, ISG, and the newly introduced LLMI-Bench. The latter emphasizes high informational density, diverse visual styles, and objective, sample-specific evaluation criteria, moving beyond the limitations of prior benchmarks.

Quantitative Results

LLM-I consistently outperforms state-of-the-art baselines across all benchmarks. Notably, LLM-I-4B and LLM-I-30B achieve perfect tool invocation accuracy (100%) on LLMI-Bench and strong human and rubric-based scores, surpassing both compositional and unified models as well as tool-augmented GPT-4o and GPT-5 agents. Ablation studies confirm the necessity of the full toolkit and hybrid reward design; removing the rule-based reward or restricting to a single tool leads to substantial performance degradation.

Qualitative Analysis

Generated examples demonstrate LLM-I's ability to intelligently select tools, produce factually grounded or programmatically precise images, and maintain strong cross-modal alignment.

Figure 6: Example generated by LLM-I on LLMI-Bench. Some text is omitted due to space constraints.

Figure 7: Example generated by MLLM-I on LLMI-Bench. Some text is omitted due to space constraints.

Figure 8: An example generated by LLM-I-4B in the ISG benchmark.

Figure 9: An example generated by LLM-I-30B in the OpenING benchmark.

Figure 10: An example generated by MLLM-I-7B in LLMI-Bench.

Figure 11: An example generated by MLLM-I-32B in LLMI-Bench.

Implementation Considerations

Backbones: LLM-I is instantiated with both LLM and MLLM backbones (Qwen3-4B/30B, Qwen2.5-VL-7B/32B).
RL Algorithms: GRPO and GSPO are used for RL fine-tuning, with token-level loss and batch size of 32.
Judges: Qwen3-235B-Instruct-2507 (LLM) and Qwen2.5-VL-72B-Instruct (MLLM) serve as reward judges.
Test-Time Scaling: TTS introduces moderate computational overhead, but tool invocations are parallelizable and selection/polishing steps are efficient.

Implications and Future Directions

LLM-I demonstrates that LLMs, when augmented with agentic tool-use capabilities and trained with hybrid RL rewards, exhibit emergent proficiency in complex, interleaved multimodal generation. This approach is inherently extensible: new tools can be integrated without retraining the core model, and the agentic planner can be further enhanced for more sophisticated reasoning and tool selection. The "proficient tool-user" paradigm offers a scalable path toward generalist, context-aware creative AI systems.

Conclusion

LLM-I establishes a new standard for interleaved image-text generation by leveraging LLMs as agentic planners orchestrating a diverse toolkit. The framework's strong empirical results, robust RL training, and extensible design validate the superiority of tool-augmented approaches over monolithic models. Future research should explore expanding the agent's toolkit, improving tool selection strategies, and generalizing the framework to broader multimodal and multi-agent scenarios.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces a new way for AI to create combined stories made of text and images, where the text and pictures appear in a back‑and‑forth sequence (interleaved). Instead of making the AI do everything inside one giant model, the authors treat the AI like a smart “project manager” that knows when to call the right tool for the job—like searching for a real photo, generating a creative picture, making a chart with code, or editing an image. They call this framework LLM‑Interleaved (LLM‑I).

What questions were the researchers asking?

In simple terms, they asked:

Can an AI do a better job creating mixed text-and-image stories if it learns to use different visual tools, instead of relying on just one?
Can we train the AI to decide which tool to use, when to use it, and how many images to include?
Will this tool‑using approach make the results more accurate, more informative, and better aligned with the text?

How did they do it?

The agent and its toolbox

Think of the AI like a writer making a blog post or report. A good writer doesn’t draw every picture from scratch. They:

Search online for real photos,
Use a graphics program to create charts,
Generate creative art if needed,
Edit images to highlight something important.

LLM‑I does the same, using four tools:

Online image search (to find real, factual photos, like a picture of the Eiffel Tower)
Diffusion-based image generation (to create new, imaginative images that don’t exist in real life)
Code execution (to make accurate charts and graphs from data using Python)
Image editing (to modify an existing image—crop it, add annotations, change colors, etc.)

How the agent calls tools

When the AI writes its response, it inserts a special “tag” wherever an image should appear. This tag tells the system which tool to use and what to do. It looks like this:

1
2
3

<imgen>{"source":"search | diffusion | code | edit",
        "description":"short title",
        "params":{ ... }}</imgen>

For “search”, params includes a query for the web image search.
For “diffusion”, params includes a prompt describing the image to generate.
For “code”, params includes code that draws a chart or visualization.
For “edit”, params includes which previous image to edit (by index) and an edit prompt.

A parser reads these tags, runs the right tool, and replaces the tag with the actual image in the final document.

Training with rewards (Reinforcement Learning)

They use Reinforcement Learning (RL), which is like giving points to the AI when it does things well. LLM‑I gets three kinds of rewards:

1) Rule-based reward: Did it follow the rules? For example, if the task says “exactly 2 images”, did it include exactly two? Did it format the tags correctly?

2) LLM judge: A separate LLM scores how well the text flows and whether the tool calls (tags) make sense in the story.

3) MLLM judge: A multimodal model (one that understands both text and images) checks if the images look good, match the surrounding text, and help achieve the goal of the task.

The final score is a mix of all three. Importantly, image quality only counts if the AI first followed the basic rules (like the right number of images). This keeps the AI from “cheating” by skipping images to avoid mistakes.

A training dataset designed for tool use

They built a dataset of around 4,000 examples where the prompts strongly imply which tool and how many images should be used—but without saying it directly. This forces the AI to think and choose. The data includes both text-only inputs and cases with input images (so the edit tool gets used). All samples are checked with strong validators to ensure quality.

Test-time scaling (trying multiple answers, picking the best)

To make results even better, they add a “test-time scaling” process:

The AI writes several complete answers with tool calls.
Broken tool calls are filtered out.
A selector model picks the top candidates.
For each candidate, they improve images (e.g., try both search and diffusion and pick the better match) and fix code if it fails (by retrying with helpful error feedback).
A polisher model cleans up the final text-and-image flow.
The best polished answer is returned.

This is like drafting multiple versions of a report, repairing any broken parts, and choosing the best one.

What did they find and why is it important?

The tool‑using agent (LLM‑I) beat other methods on several benchmarks, including tough tests that require real photos, precise charts, and good text‑image alignment. In short, it set new state‑of‑the‑art performance.
It was very good at using tools correctly (near‑perfect tool success in some setups), and the test‑time scaling made it even better—sometimes enough to beat much larger models.
When they removed parts of the reward system, performance dropped, especially without the rule-based reward (the AI would start to ignore image requirements). This shows each reward plays an important role.
When they forced the AI to use only one tool (like only search or only diffusion), performance got worse. Having the full toolbox—and knowing which tool to pick—is crucial.

Why this matters:

Many real tasks need different kinds of images: real photos (for facts), generated art (for creativity), and charts (for data). A single “all-in-one” image model can’t handle all of these equally well.
LLM‑I’s “proficient tool user” approach creates smarter, more useful multimodal reports that are accurate, creative, and aligned with the text.

What does this mean for the future?

This work shows that LLMs can naturally act like thoughtful creators when given the right tools and training. Instead of trying to shove every skill into one giant model, we can build AI agents that:

Decide which specialized tool to use,
Adapt to new tools without retraining everything,
Produce richer, more reliable text-and-image content.

This could improve many applications:

Homework help and paper guides with correct charts and real photos,
News explainers that mix trusted images with clear visuals,
Professional reports that combine data analysis, diagrams, and narrative.

Overall, LLM‑I points to a future where AI is less like a “know‑it‑all” and more like a great teammate—one that reasons well and uses the right tools to get the job done.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of the paper’s knowledge gaps, limitations, and open questions that future work could address.

Dataset scale and diversity: Only ~4k RL training items; unclear if this is sufficient for robust tool selection and interleaved generation across diverse domains and styles.
Synthetic prompt bias: Prompts are auto-generated (Gemini 2.5 Pro) and may encode biases, simplifying assumptions, or narrow instructional styles; limited evidence of human-authored or user-sourced instructions.
Language coverage: All data and evaluations appear English-centric; generalization to non-English or code-mixed inputs is untested.
Modality scope: Framework is text–image only; extension to video, audio, 3D, or interactive visualizations is not demonstrated or evaluated.
Toolset coverage: Only four tool types (search, diffusion, code, edit); unclear how easily the agent scales to new tools (OCR, diagramming, rendering engines, GIS, C2PA provenance checks, retrieval-augmented captioners) without retraining.
Zero-/few-shot tool addition: No experiment showing the agent can operationalize a previously unseen tool from schema-only or minimal demonstrations.
Factual grounding of retrieved images: No metric validating that retrieved images truly depict the claimed entities/events, nor source attribution or citation evaluation.
Copyright and licensing: No discussion of licensing, usage rights, or attribution for web images; unclear handling of watermarks and content provenance.
Safety filters: Absent evaluation of NSFW/violent content filtering for both retrieved and generated images; no toxicity or sensitive content audits.
Deepfake/edit risks: Image editing capability could enable harmful manipulations; no safeguards, watermarking, or provenance-preserving mechanisms are described.
Privacy and data leakage: Web queries may expose sensitive user data; code execution may risk data exfiltration; sandbox egress policies and privacy protections are unspecified.
Code sandbox security: Limited details on sandbox isolation, time/memory limits, filesystem/network egress, package trust, and mitigation of adversarial inputs and infinite loops.
Parser robustness: The system relies on a custom <imgen>{...}</imgen> tag and parsing; resilience to malformed tags, prompt injection, or adversarial schema deviations is untested.
Tool-call failure modes: Aside from code auto-repair, no systematic analysis of failure recovery for search (dead links, outages), diffusion (time-outs), or edit tools; no retry/backoff policies evaluated.
Determinism and reproducibility: Online search and diffusion are non-deterministic; no caching/seeding strategy is reported; repeatability of benchmark scores over time is uncertain.
Judge bias and circularity: Rewards depend on LLM/MM-LLM judges (Qwen-family) closely related to the fine-tuned backbones, risking bias, gaming, or family-overlap advantages; rating rubric prompt transparency and inter-judge consistency are not reported.
Reward design sensitivity: Fixed weights (w_rule, w_LLM, w_mLLM) and penalty factor α=0.3 are not ablated; robustness to these hyperparameters is unknown.
Reward formula clarity: The piecewise definition of R_rule appears malformed/ambiguous in the text; exact implementation details (e.g., under-generation partial credit) need clarification.
Multiplicative gating effect: R_mLLM is multiplied by R_rule; this may suppress learning signal for image-text quality when image count is incorrect; alternatives (e.g., soft gating, curriculum) are unexplored.
Sample efficiency and variance: Training steps, run-to-run variance, and compute costs for RL with external judges/tools are not reported; stability across seeds is unknown.
Test-time scaling confounds: Selector/polisher uses a strong external MLLM (Qwen2.5-VL-72B); improvements may largely stem from this bigger model rather than the base agent; fairness and ablations across different selectors are missing.
Cost–quality trade-offs: Only coarse latency is reported; no systematic Pareto analysis of k, number of candidates, image samples per tool, or judge calls versus quality and cost.
Human evaluation rigor: Human eval details (annotator count, expertise, inter-annotator agreement, confidence intervals) are absent; the sample size for human eval is unclear.
LLMI-Bench scale and generality: The new benchmark has only 30 samples; potential overfitting and limited coverage of real-world variability; protocols to prevent tuning to these items are not described.
Objective criteria reliability: LLMI-Bench uses GPT-4o to score per-rule; robustness of these rule checks, false positives/negatives, and sensitivity to prompt phrasing are not audited.
Baseline transparency: Inclusion of “GPT-5 wTool” lacks methodological transparency; details on configuration, toolset parity, and evaluation settings are insufficient for fair comparison.
Attribution and citations in outputs: Retrieved images’ URLs, metadata, or citations are not embedded in the generated reports; transparency and traceability are lacking.
Time-sensitive knowledge: How the system handles recency (e.g., breaking events) and contradicting sources is not evaluated; no temporal grounding metrics are provided.
Multi-image consistency: Limited analysis of entity/style consistency across multiple images in a sequence, especially for narratives requiring persistent characters or scenes.
Control over aesthetics and layout: No evaluation of stylistic control (e.g., consistent color palettes, design systems) or document-level layout constraints and pagination.
Robustness to adversarial prompts: No tests for prompt injection, tool-parameter attacks (e.g., escaping JSON), or image-based attacks against MLLM judges.
Generalization across tool APIs: The approach is coupled to specific APIs (SERP, Seedream 3.0, Seededit 3.0); portability to alternative providers and API schema drift is untested.
Multi-turn interaction: The framework is single-pass with embedded tags; it does not support iterative, user-in-the-loop refinement or revision workflows typical for report authoring.
Error analysis: Limited breakdown of common failure modes (wrong tool choice, incorrect image counts, misaligned edits, faulty charts); actionable diagnostics are missing.
Broader impacts and sustainability: No discussion of environmental cost of RL with external tool/judge calls, nor cost-aware training/inference policies.

View Paper Prompt View All Prompts

Glossary

Agentic planner: An LLM/MLLM configured to plan and decide how to use external tools during generation. "we introduce a flexible and dynamic framework where an LLM or MLLM serves as an agentic planner."
Autoregressive: A modeling approach that generates outputs one token at a time conditioned on previous tokens. "integrate an autoregressive model with a diffusion model"
Clipping ratios: Bounds used in policy-gradient RL to stabilize updates by limiting the magnitude of policy changes. "For GSPO, the clipping ratios are set to 3e-4 (low) and 4e-4 (high)."
Cosine learning rate scheduler: A schedule that decays the learning rate following a cosine curve. "we use a batch size of 32 with a cosine learning rate scheduler where the initial learning is set to 1e-6, minimum learning rate ratio is set to 0.01, and the warm-up step is 5."
Cross-modal consistency: The degree to which outputs (e.g., text and images) across different modalities remain semantically coherent. "requiring high-fidelity text and images with strict cross-modal consistency."
DAPO: A value-free reinforcement learning algorithm used to fine-tune LLMs without an explicit value model. "value-free alternatives like GRPO~\citep{shao2024deepseekmathpushinglimitsmathematical} and DAPO~\citep{yu2025dapoopensourceLLMreinforcement}."
Diffusion-based generation: Image synthesis using diffusion models that iteratively denoise random noise to produce images. "Diffusion-based Generation: Selected for tasks requiring the creative synthesis of novel or abstract concepts, or complex compositions that do not exist in reality."
End-to-end models: Architectures that learn to handle all sub-tasks jointly within a single system. "unified, end-to-end models~\citep{xie2025show, zhou2025transfusion, deng2025emerging} that handle both multimodal understanding and generation within a single, integrated framework."
F1 score (Tool F1 score): The harmonic mean of precision and recall; here used to evaluate tool-selection accuracy. "Tool F1 score curve during RL training."
Factual grounding: Ensuring outputs are anchored in real-world facts or data. "ill-suited for tasks that require factual grounding such as real-world images"
GRPO: A value-free RL method for fine-tuning LLMs that removes the need for a learned value function. "value-free alternatives like GRPO~\citep{shao2024deepseekmathpushinglimitsmathematical}"
GSPO: An RL algorithm (for MoE models) used to fine-tune policies, related to PPO-style optimization. "For MoE model, we use GSPO~\citep{zheng2025group} as the RL algorithm"
Hybrid reward system: A composite RL reward combining rule-based checks with LLM/MLLM judgments. "a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators."
Image editing: Automated modification of images (e.g., crop, color adjust, annotate) via an editing tool. "Image Editing: Engaged to perform modifications on existing visual content, whether inputted, retrieved or generated."
Image num constraint: A rule specifying how many images should be generated or whether they are allowed/required. "A critical feature of this dataset is the annotation of each prompt with an image num constraint."
Interleaved image-text generation: Producing an alternating sequence of text and images that form a coherent narrative. "A key frontier is interleaved image-text generation~\citep{ge2024seed, tian2024mminterleavedinterleavedimagetextgenerative, xie2025show, zhou2025opening, xia2025mmie,chen2025interleaved}: producing a coherent, alternating sequence of text and images from a single prompt."
LLM judge: An external LLM used to score text quality and tool-use logic during RL. "The second component R_LLM leverages an external LLM as a judge to assess the quality of the language and the logic of the tool invocation."
Mixture-of-Experts (MoE): A model architecture that routes inputs to a subset of specialized expert subnetworks. "For MoE model, we use GSPO~\citep{zheng2025group} as the RL algorithm"
MLLM (Multimodal LLM): A large model that can process and reason over multiple modalities (e.g., text and images). "The third reward component R_mLLM employs an MLLM to evaluate the final interleaved output."
Multimodal alignment: How well different modalities (text, image) correspond semantically. "we investigate how RL can be used to improve multimodal alignment, the ability to intelligently use tools, and the overall quality of generated reports."
Out-of-domain (OOD): Data or tasks that differ from the distribution seen during training. "three out-of-domain (OOD) benchmarks."
Polishing (stage): A post-processing step that refines and improves coherence of the generated multimodal response. "the k refined interleaved multimodal responses are passed to an MLLM for polishing."
Proximal Policy Optimization (PPO): A popular RL algorithm that stabilizes policy updates via clipping. "While Proximal Policy Optimization (PPO)~\citep{schulman2017proximalpolicyoptimizationalgorithms} is the most common algorithm for fine-tuning LLMs"
Reinforcement Learning (RL): A learning paradigm where an agent learns to act via rewards and penalties. "We develope a Reinforcement Learning (RL) framework that incorporates a hybrid reward design"
Reward hacking: Exploiting imperfections in the reward signal to achieve high scores without truly solving the task. "to counteract the agent's potential aversion to more error-prone tools during RL (a form of reward hacking)"
Rule-based reward: A deterministic reward component enforcing explicit constraints (e.g., image count, tag format). "The first component is a deterministic, rule-based reward R_rule that enforces adherence to generation constraints"
Sandboxed environment: A restricted execution environment for safely running untrusted code. "which is then re-executed in a sandboxed environment until a valid visualization is obtained"
Selector model: A model used to rank candidate responses and choose the best one. "a selector model (an LLM/MLLM) evaluates their overall quality and relevance to the prompt"
Semantic alignment: The correctness of meaning correspondence between image and surrounding text. "strong semantic alignment between each image and its accompanying text."
Semantic gap: Mismatch between an LLM’s textual intent and a diffusion model’s image interpretation. "it often suffers from a 'semantic gap'"
Stochastic sampling: Randomized decoding (e.g., temperature/top-k) to generate diverse candidate outputs. "the model first generates multiple complete candidate responses through stochastic sampling."
Structured <imgen> tag: A JSON-like placeholder in outputs that encodes tool calls and parameters for image generation. "The core of our framework is the structured tag, <imgen>{...}"
Test-time scaling: Improving performance by expending extra computation at inference (e.g., multiple candidates, selection). "we introduce a test-time scaling~\citep{snell2025scaling, muennighoff2025s1simpletesttimescaling} paradigm"
Token-level loss: An objective computed per token rather than only at the sequence level. "Following \citet{yu2025dapoopensourceLLMreinforcement}, we use the token-level loss."
Tool invocation framework: The mechanism that lets the model decide when and how to call external tools. "we design a robust and flexible tool invocation framework."
Tool-augmented system: A modeling approach that extends capabilities by integrating external tools. "A tool-augmented system can be easily updated with new capabilities by simply adding a new tool to its repertoire"
Top-k selection: Choosing the k highest-scoring candidates for subsequent refinement or evaluation. "selecting the top-k most promising responses for further enhancement."
Unified multimodal models: Single models trained to both understand and generate across modalities. "unified multimodal models that either integrate an autoregressive model with a diffusion model"
Value-free alternatives: RL methods that avoid training an explicit value function (critic). "its reliance on a value model has spurred the popularity of value-free alternatives like GRPO"
Value model: The critic in actor-critic RL estimating expected returns to guide learning. "its reliance on a value model has spurred the popularity of value-free alternatives"
Visual tokens: Learnable embeddings that represent images (or parts of images) within a text-generation pipeline. "optimize a set of learnable visual tokens that serve as input for a diffusion-based image decoder"
Warm-up step: Initial training phase where the learning rate ramps up from a low value. "the warm-up step is 5."

View Paper Prompt View All Prompts

Continue Learning

Authors (4)

Collections

GitHub

GitHub - ByteDance-BandAI/LLM-I: An LLM Agent Framework for Interleaved Report Generation

Tweets

This paper has been mentioned in 3 posts and received 191 likes.

alphaXiv

LLM-I: LLMs are Naturally Interleaved Multimodal Creators (49 likes, 0 questions)

LLM-I: LLMs are Naturally Interleaved Multimodal Creators (2509.13642v1)

Summary

LLM-I: A Tool-Augmented Agentic Framework for Interleaved Multimodal Generation

Introduction

Motivation and Framework Design

Limitations of Prior Approaches

Agentic Tool-Orchestration

Reinforcement Learning for Tool Mastery

Dataset Construction

Hybrid Reward Design

Test-Time Scaling

Benchmarking and Empirical Results

Evaluation Protocol

Quantitative Results

Qualitative Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers asking?

How did they do it?

The agent and its toolbox

How the agent calls tools

Training with rewards (Reinforcement Learning)

A training dataset designed for tool use

Test-time scaling (trying multiple answers, picking the best)

What did they find and why is it important?

What does this mean for the future?

Knowledge Gaps

Glossary

Continue Learning

Related Papers

Authors (4)

Collections

GitHub

Tweets

alphaXiv

Don't miss out on important new AI/ML research