DuoGen: Interleaved Multimodal Generation

Updated 4 July 2026

DuoGen is a modular interleaved multimodal generation system that produces alternating sequences of text and images, enabling dynamic instructional and editing tasks.
It leverages a frozen MLLM for text continuation and a video-pretrained DiT for image synthesis through a dedicated connector and cross-attention mechanism.
The system employs a two-stage decoupled training approach with a curated instruction-tuning dataset and extensive alignment data to ensure consistent and high-performance outputs.

Searching arXiv for DuoGen and closely related interleaved multimodal generation papers for citation support. DuoGen is a modular framework for general-purpose interleaved multimodal generation in which a model produces mixed sequences of text and images in alternating order, with each new output conditioned on the full preceding multimodal context. Introduced in "DuoGen: Towards General Purpose Interleaved Multimodal Generation" (Shi et al., 31 Jan 2026), it targets tasks such as step-by-step instructional guides, visual planning, image-conditioned generation, and image editing, and is defined by a design in which a pretrained multimodal LLM (MLLM) governs textual continuation and modality switching while a pretrained diffusion transformer (DiT) synthesizes images when invoked.

1. Definition and task formulation

DuoGen addresses a setting in which the output is not restricted to a single modality. The objective is to generate a continuation of an interleaved context that may contain both text and images, with the system itself deciding when a visual output is required. In the paper’s operational description, the interleaved prefix before a new image is represented as

$\{T_1, I_1, T_2, I_2, \cdots, T_N\},$

where $T_i$ are text chunks and $I_i$ are images (Shi et al., 31 Jan 2026).

This formulation differs from conventional text-only assistants, standard text-to-image pipelines, and visual-understanding systems. A general-purpose interleaved model must jointly support instruction following, visual understanding of user-supplied images, coherent language generation across steps, image generation aligned with earlier text and images, continuation of partially given multimodal sequences, and editing or manipulation where previous images remain active context. DuoGen is explicitly designed for this broader problem class rather than for isolated text generation or isolated image synthesis alone (Shi et al., 31 Jan 2026).

A central conceptual distinction is that DuoGen is not an early-fusion model pretrained from scratch on mixed multimodal corpora. Instead, it asks whether interleaved alignment can be built directly on top of strong pretrained modules. This suggests a different systems view of multimodal generation: rather than collapsing understanding and rendering into a single pretraining regime, it separates control, reasoning, and visual synthesis while preserving a unified interaction loop.

2. Data curation and instruction-tuning corpus

DuoGen’s data strategy is organized around two regimes: a curated instruction-tuning dataset for training the MLLM to follow interleaved multimodal instructions, and a much larger context-alignment corpus for teaching the image generator to remain consistent with preceding text and images (Shi et al., 31 Jan 2026).

The instruction-tuning set contains 298k interleaved conversation samples, comprising 268k conversations rewritten from curated websites and 30k high-quality synthetic interleaved samples. The website source includes StoryBird, Instructables, eHow, and raw data reused from CoMM. The curation pipeline removes pages containing only text and discards invalid images such as QR codes, icons, and advertisements. Main webpage content is converted into Markdown, then processed through four stages: rewrite and split raw interleaved content; caption and categorize all images; remove duplicate images and reorder text-image chunks; and convert the cleaned sequence into a user–assistant conversation (Shi et al., 31 Jan 2026).

The synthetic subset is introduced because website images are often inconsistent in quality, resolution, style, and step granularity. To address this, the authors build a hierarchical prompt pool over 8 broad domains—Sports, Outdoor & Survival, DIY & Crafting, Vehicle & Transportation, Personal Care & Health, Farm, Pet, and Animals, Home & Living, and Office & Productivity—which human annotators refine into 151 subcategories with around 10 seed questions per subcategory, for a total of 1,500 seed prompts. These are expanded using OpenAI O3 into 15,270 diverse instructions. The paper also reports a cooking-focused synthetic branch using 15k dish images sampled from MM-Food-100k, with invalid images and invalid dish names filtered out (Shi et al., 31 Jan 2026).

For the second training stage, DuoGen uses a substantially larger alignment mixture. The video-derived component begins from 5 million raw videos, segmented into 5-second clips, filtered with scene detection for temporal consistency, and annotated with Qwen2.5-VL-32B to describe object motion, human actions, and camera movements. The supplement further enumerates 939k text-to-image alignment samples, 5,586k image editing samples, and 5,657k interleaved context alignment samples. The paper is explicit that the clearest core instruction-tuning set is the 298k high-quality conversation corpus, whereas the second-stage alignment mixture is broader and more heterogeneous (Shi et al., 31 Jan 2026).

A concise summary of the principal data layers is as follows:

Data component	Scale	Role
Rewritten website conversations	268k	Stage-1 instruction tuning
Synthetic interleaved conversations	30k	Stage-1 instruction tuning
Raw videos for transition mining	5 million	Stage-2 context alignment
Text-to-image alignment data	939k	Stage-2 image alignment
Image editing data	5,586k	Stage-2 image alignment
Interleaved context alignment data	5,657k	Stage-2 image alignment

This curation pipeline is one of DuoGen’s defining features. The empirical ablation in the paper indicates that data cleaning and conversation rewriting are not auxiliary conveniences but major determinants of final performance.

3. Architecture and multimodal interface

DuoGen combines two pretrained components: an MLLM for understanding images, following instructions, generating text, and deciding when an image should be produced; and a DiT, initialized from a video generation model, for generating the next image conditioned on the full prior interleaved context. In the reported implementation, the backbones are Qwen2.5-VL 7B for the MLLM and Cosmos Predict 2.5 (2B) for the DiT (Shi et al., 31 Jan 2026).

The modality-switching mechanism is the special token <BOV> (“Begin-of-Vision”). The MLLM autoregressively predicts text tokens until it emits <BOV>, at which point control passes to the image generator. The newly synthesized image is then appended back into the multimodal history, after which text generation resumes. This yields a recurrent loop: user input → text generation → <BOV> → image generation → append image → continue text (Shi et al., 31 Jan 2026).

Conditioning of the DiT is bimodal. First, all previous images in the context—both user-provided and previously generated—are stacked as condition frames, encoded by the VAE encoder, and concatenated with the noisy latent of the target image along the temporal axis, following Cosmos Predict 2.5’s strategy. Second, the model extracts the MLLM hidden states corresponding to all multimodal tokens before <BOV>, concatenates hidden states from all decoder layers along the channel dimension, and passes them through a lightweight connector that projects them into the DiT conditioning space. At each DiT decoder layer, guidance is injected by cross-attention between image latents and these projected language embeddings (Shi et al., 31 Jan 2026).

This architecture is explicitly modular. A plausible implication is that the paper treats interleaved generation less as a monolithic pretraining problem than as an interface problem between reasoning and rendering modules. The use of a video-pretrained DiT is particularly consequential, because interleaved generation often requires consistency across multiple previous images rather than conditioning on a single prompt or a single reference frame.

4. Decoupled training and inference loop

DuoGen uses a two-stage decoupled strategy. In Stage 1, only the MLLM is instruction-tuned, using the 298k curated interleaved conversations. The objective is standard next-token prediction over assistant outputs: $\mathcal{L}_{\text{text}} = - \sum_{t \in \mathcal{A}} \log p_\theta(x_t \mid x_{<t}),$ where user tokens are masked out and <BOV> is included when it appears in the assistant turn (Shi et al., 31 Jan 2026).

In Stage 2, the MLLM is frozen, while the connector and DiT are trained on the broader context-alignment mixture. For each interleaved sequence, one target image is sampled, a random diffusion step is drawn from the scheduler, and the DiT is optimized with a diffusion-style objective, specifically described as permitting flow matching: $\mathcal{L}_{\text{img}} = \mathbb{E}_{(S,I^*),t}\left[\ell_{\text{diff}}\left(\text{DiT}(z_t^{I^*}, t; c_{\text{vis}}(S), c_{\text{text}}(S)),\, \text{target}\right)\right].$ The paper does not specify a more detailed target parameterization beyond this description (Shi et al., 31 Jan 2026).

At inference time, the system repeatedly feeds the current interleaved context to the MLLM, generates text token by token, halts when <BOV> appears, synthesizes an image with the DiT, appends that image to the context, and resumes text generation. Classifier-free guidance is used for image sampling; for the negative velocity, visual conditions are kept fixed while the final text chunk is removed from the MLLM hidden-state sequence. Additional implementation details reported in the paper include an MLLM image cap of 480 px max side length to avoid OOM and qualitative outputs shown at 768 × 768 resolution (Shi et al., 31 Jan 2026).

This decoupled training regime is a defining methodological choice. The paper’s rationale is that instruction-following behavior should be protected from noisier alignment data, while image-context consistency can be learned at scale without further perturbing the MLLM.

5. Empirical performance and benchmark profile

DuoGen is evaluated on public interleaved generation benchmarks, a newly introduced benchmark, text-to-image evaluation, and image editing. On CoMM, it reports Sty. 9.22, Enti. 9.22, Tren. 9.24, Comp. 9.66, ImgQ 9.53, and IRS 7.76, exceeding the listed open-source baselines including MiniGPT-5, SEED-LLaMA, and Emu2. On InterleavedBench, it reports T-Q 4.28, I-Q 3.65, I-Co 3.70, IT-Co 3.69, Helpfulness 4.06, and Avg. 3.87 (Shi et al., 31 Jan 2026).

The paper also introduces a new benchmark with Cooking-200, Cooking-200-Text-Input, and How-to-500. On Cooking-200, DuoGen reports T-Com 3.61, I-Com 4.70, I-Co 3.92, I-Q 4.78, and IT-Co 4.75. On Cooking-200-Text-Input, it reports 3.82, 4.77, 4.17, 4.79, and 4.76 for the same metrics. On How-to-500, it reports T-Com 3.39, I-Com 4.22, I-Co 4.21, I-Q 4.08, and IT-Co 4.18. In these comparisons, the paper states that DuoGen strongly outperforms open-source baselines and approaches Nano Banana on several subsets, while exceeding it on specific metrics such as image completeness in the cooking setting (Shi et al., 31 Jan 2026).

DuoGen also performs strongly outside explicitly interleaved benchmarks. On GenEval, it reports Single Object 0.82, Two Object 0.99, Counting 0.94, Colors 0.91, Position 0.84, Attribute Binding 0.80, and Overall 0.88. On ImgEdit, it reports Add 4.53, Adjust 4.33, Extract 2.28, Replace 4.69, Remove 4.71, Background 4.61, Style 4.51, Hybrid 3.85, Action 4.67, and Overall 4.19. On GEdit_EN, the scores reported are G_SC 7.68, G_PQ 7.76, and G_O 7.35 (Shi et al., 31 Jan 2026).

The data ablation is especially informative. On CoMM, training with CoMM original data yields Sty. 6.14, Enti. 6.21, Tren. 6.52, Comp. 6.45, ImgQ. 6.30, IRS 4.42; processing the data with the authors’ engine improves these to 7.85, 7.76, 7.22, 8.15, 7.79, 5.91; and adding synthetic data further improves them to 9.15, 9.21, 9.30, 9.45, 9.48, 7.58. This is one of the clearest empirical demonstrations in the paper that data quality and data form, rather than architecture alone, are central bottlenecks for interleaved generation (Shi et al., 31 Jan 2026).

6. Position within multimodal generation research

DuoGen occupies a specific place within multimodal generation research. It is neither a pure tokenizer-level unification method nor a fully fused multimodal model pretrained end to end from scratch. Instead, it is a modular interleaved-generation system that uses an MLLM for control and a video-pretrained DiT for rendering, linked through hidden-state projection and cross-attention. In that respect, it is closely related to a broader shift toward reusing specialized pretrained modules rather than rebuilding joint models from the ground up (Shi et al., 31 Jan 2026).

A useful contrast is with DualToken, which addresses a different bottleneck: the incompatibility between the visual representations required for understanding and those required for generation within autoregressive multimodal models. DualToken resolves that conflict through dual visual vocabularies—semantic and perceptual—inside a unified tokenizer, whereas DuoGen resolves a different systems problem by coupling a pretrained MLLM with a pretrained DiT through a decoupled interface (Song et al., 18 Mar 2025). This suggests that “unified multimodal generation” now encompasses at least two distinct design lines: tokenizer-centric unification and module-centric interleaved orchestration.

The paper also identifies several limitations. It does not provide extensive controlled ablations over MLLM choice, DiT backbone, connector depth, or cross-attention versus MMDiT-style conditioning. The MLLM remains frozen during the second stage, which preserves instruction-following behavior but may leave some language–image co-adaptation unrealized. Sequence-level modality decisions are learned through <BOV> token prediction rather than via a dedicated planner. Exact robustness under highly complex or very long interleaved contexts is not fully explored. The authors explicitly list scaling up pretrained components, deeper architectural ablations, and comparison to MMDiT-style alternatives as future work (Shi et al., 31 Jan 2026).

Within the literature, DuoGen’s principal significance lies in showing that interleaved multimodal generation can be approached as a modular alignment problem rather than solely as a monolithic pretraining problem. Its central claim is not only that one model can both write and draw, but that strong interleaved performance can emerge from the combination of high-quality instruction data, a reusable pretrained MLLM, a reusable pretrained video DiT, and a carefully designed interface between them (Shi et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

DuoGen: Towards General Purpose Interleaved Multimodal Generation (2026)

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DuoGen.