Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniGen-1.5: Unified Multimodal Model

Updated 3 July 2026
  • UniGen-1.5 is a unified multimodal large language model that simultaneously performs image understanding, text-to-image generation, and image editing through a shared architectural design.
  • It employs dual visual front-ends—a continuous encoder and a discrete tokenizer—to fuse semantic image cues with autoregressive token prediction for enhanced synthesis and reconstruction.
  • The unified reinforcement learning and lightweight Edit Instruction Alignment stage boost performance, achieving a GenEval score of 0.89 and an ImgEdit score of 4.31.

UniGen-1.5 is a unified multimodal LLM (MLLM) for image understanding, text-to-image generation, and image editing. Introduced as a 2025 preprint, it extends UniGen by modifying both the model architecture and the training pipeline, with particular emphasis on a unified reinforcement learning strategy that optimizes image generation and image editing jointly through shared reward models, and on a lightweight Edit Instruction Alignment stage designed to improve instruction comprehension before reinforcement learning (Tian et al., 18 Nov 2025). The reported outcome is a single system that remains competitive on image understanding, reaches a GenEval overall score of 0.89, and attains an ImgEdit overall score of 4.31, surpassing BAGEL and approaching proprietary systems such as GPT-Image-1 on the cited benchmarks (Tian et al., 18 Nov 2025).

1. System definition and functional scope

UniGen-1.5 is framed as a single autoregressive MLLM that handles three task families: image understanding, generation, and editing (Tian et al., 18 Nov 2025). The paper’s central design choice is unification: rather than treating text-to-image synthesis and image editing as separate model families with distinct optimization objectives, UniGen-1.5 trains them within a common architecture and aligns them further through a shared reinforcement learning stage.

The scope is broader than conventional diffusion-based editing pipelines or LMM-only understanding systems. In the reported formulation, generation is conditioned on text, while editing is conditioned on a condition image, text instruction, and discrete visual tokens derived from the same condition image. This makes editing a native model behavior rather than an external module attached to a generator.

A frequent assumption in multimodal modeling is that high-quality editing requires a specialized edit reward model or a separate editor architecture. UniGen-1.5 explicitly rejects that assumption: it avoids training a bespoke edit reward model by reformulating editing as a generation problem against a textual description of the desired output. This suggests that the paper treats reward unification, rather than task-specific reward engineering, as the principal mechanism for cross-task transfer.

2. Backbone architecture and visual representations

The foundation LLM is Qwen2.5-7B, which serves as the autoregressive core (Tian et al., 18 Nov 2025). Around this core, UniGen-1.5 uses two visual front-ends that remain frozen after initialization.

The first is a continuous encoder, EncU\mathrm{Enc}^U, instantiated as SigLIP2. It accepts variable-resolution inputs with arbitrary aspect ratio and produces a sequence of continuous embeddings,

XU=EncU(X).X^U = \mathrm{Enc}^U(X).

The second is a discrete tokenizer, EncG\mathrm{Enc}^G, instantiated as MAGViTv2. It tokenizes a 384×384384 \times 384 image into a one-dimensional sequence of discrete visual tokens,

XG=EncG(X).X^G = \mathrm{Enc}^G(X).

Modality fusion is implemented with an MLP that projects both continuous image embeddings and text embeddings into the LLM feature space and concatenates them for next-token prediction. In operational terms, the model therefore combines a continuous visual stream for perceptual conditioning with a discrete visual stream for token-level image synthesis and reconstruction. A plausible implication is that the continuous path supports semantic grounding and preservation of visual context, while the discrete path provides a tractable autoregressive target space for generation and editing.

For image editing, the paper reports that the optimal sequence order is

[XCU][TC][XCG].[X_C^U] \rightarrow [T_C] \rightarrow [X_C^G].

This ordering places continuous condition-image embeddings first, followed by the text instruction and then the discrete tokens of the condition image. The result is a single-token-prediction framework in which editing remains structurally close to generation, but with additional conditioning channels supplying both semantic and low-level visual cues.

3. Generation and editing objectives

Text-to-image generation uses a masked-token objective in the style of MaskGIT on the discrete token sequence XGX^G (Tian et al., 18 Nov 2025). For each token index ii, a binary mask variable mi{0,1}m_i \in \{0,1\} is sampled according to a schedule γ()\gamma(\cdot). The masked input XU=EncU(X).X^U = \mathrm{Enc}^U(X).0 is formed by replacing token XU=EncU(X).X^U = \mathrm{Enc}^U(X).1 with XU=EncU(X).X^U = \mathrm{Enc}^U(X).2 when XU=EncU(X).X^U = \mathrm{Enc}^U(X).3. The training objective is

XU=EncU(X).X^U = \mathrm{Enc}^U(X).4

At inference time, generation iteratively unmasks tokens over 50 steps using a cosine masking schedule and classifier-free guidance with scale XU=EncU(X).X^U = \mathrm{Enc}^U(X).5. This places UniGen-1.5 within the family of discrete-token iterative decoders rather than one-shot autoregressive raster models.

Image editing uses the same masked-token prediction principle, but the conditioning changes. The model is given the continuous embedding of the condition image XU=EncU(X).X^U = \mathrm{Enc}^U(X).6, the text instruction XU=EncU(X).X^U = \mathrm{Enc}^U(X).7, and the discrete tokens of the condition image XU=EncU(X).X^U = \mathrm{Enc}^U(X).8. The LLM then predicts the edited discrete tokens corresponding to the target image XU=EncU(X).X^U = \mathrm{Enc}^U(X).9. Reconstruction back to pixels is performed with the MAGViTv2 decoder.

The significance of this formulation is methodological rather than merely architectural. Generation and editing are not separated into heterogeneous learning problems; both are reduced to masked discrete-token prediction under different conditioning regimes. This suggests that much of the paper’s performance gain derives from keeping the predictive target space shared across tasks while varying only the context.

4. Unified reinforcement learning and shared rewards

The reinforcement learning stage is based on Group Relative Policy Optimization (GRPO) (Tian et al., 18 Nov 2025). The policy EncG\mathrm{Enc}^G0 generates EncG\mathrm{Enc}^G1 candidate images EncG\mathrm{Enc}^G2, conditioned either on text EncG\mathrm{Enc}^G3 for generation or on EncG\mathrm{Enc}^G4 for editing. Each candidate receives a scalar reward EncG\mathrm{Enc}^G5 from a shared reward function EncG\mathrm{Enc}^G6.

The group-normalized advantage is

EncG\mathrm{Enc}^G7

where EncG\mathrm{Enc}^G8 is the mean of the group rewards and EncG\mathrm{Enc}^G9 is the corresponding standard deviation. The GRPO objective, with a KL penalty to the initial policy 384×384384 \times 3840, is

384×384384 \times 3841

where

384×384384 \times 3842

384×384384 \times 3843 is a small clipping hyper-parameter, and 384×384384 \times 3844.

The reward model is an ensemble of four off-the-shelf vision experts:

  • CLIP-H similarity, 384×384384 \times 3845
  • HPSv2 aesthetic/alignment score, 384×384384 \times 3846
  • Unified-Reward-7B fine-grained consistency, 384×384384 \times 3847
  • ORM “Yes/No” outcome model, 384×384384 \times 3848

The final scalar reward is

384×384384 \times 3849

For generation, the target text XG=EncG(X).X^G = \mathrm{Enc}^G(X).0 is the ground-truth prompt. For editing, XG=EncG(X).X^G = \mathrm{Enc}^G(X).1 is a synthesized caption of the desired edit produced using Qwen-72B. This is the key unification step: editing is scored against a textual description of the intended output rather than through a dedicated edit-specific reward network. The paper reports that joint RL outperforms generation-only RL and edit-only RL in the combined metric profile, with the unified variant achieving GenEval 0.89 and ImgEdit 4.31.

5. Edit Instruction Alignment and staged training

Before reinforcement learning, the paper reports that the model struggled to satisfy diverse editing instructions and produced low reward variance among candidates, leading to a weak RL signal (Tian et al., 18 Nov 2025). The proposed remedy is a lightweight Post-SFT Edit Instruction Alignment stage.

This stage trains the model to map the condition-image embedding and instruction, XG=EncG(X).X^G = \mathrm{Enc}^G(X).2, to a textual description of the intended edited image. The dataset contains 17,663 triplets of condition image, edit instruction, and output description. Given XG=EncG(X).X^G = \mathrm{Enc}^G(X).3, the model generates an output caption XG=EncG(X).X^G = \mathrm{Enc}^G(X).4 with the standard cross-entropy loss

XG=EncG(X).X^G = \mathrm{Enc}^G(X).5

The alignment stage runs for 500 steps with learning rate XG=EncG(X).X^G = \mathrm{Enc}^G(X).6 and batch size 64, immediately after SFT and before RL. The paper states that this stage maintains understanding performance, with no drop on VQA and hallucination benchmarks, while sharpening edit-instruction comprehension.

The full four-stage training pipeline is as follows:

Stage Steps / Batch / LR Data ratio (Und:Txt:Gen:Edit)
Pre-train 300 K / 576 / 1e-4 2:1:3:-
SFT 73 K / 128 / 2e-5 8:2:3:3
Edit-Align 0.5 K / 64 / 1e-5 1:-:-:-
RL 1.5 K / 32 / 3e-6 -:-:1:1

Pre-training mixes ImageNet, CC-3M, CC-12M, SAM-11M, image understanding data, and text-only data. SFT adds BLIP-3o and ShareGPT-4o synthetic T2I samples, GPT-1.5 M edit data, and SlowFast-LLaVA-1.5 understanding data. RL uses T2I-R1 prompts with 6,786 samples and Edit-RL with 10,568 edit triplets. RL decoding uses XG=EncG(X).X^G = \mathrm{Enc}^G(X).7 candidates per input, 16-step mask prediction, and KL penalty XG=EncG(X).X^G = \mathrm{Enc}^G(X).8.

6. Empirical performance, ablations, and interpretation

On text-to-image generation, UniGen-1.5 achieves a GenEval overall score of 0.89, compared with BAGEL at 0.82, BLIP3-o at 0.84, and GPT-Image-1 at 0.84 (Tian et al., 18 Nov 2025). The paper reports category scores of 0.93 on Two-object, 0.80 on Counting, 0.92 on Position, and 0.81 on Color Attri. On DPG-Bench, the overall score is 86.83, compared with 81.60 for BLIP3-o, with Attribute at 90.55 and Entity at 92.64.

On image editing, the ImgEdit overall score is 4.31 on a 5-point scale, compared with Qwen-Image at 4.27, GPT-Image-1 at 4.20, OmniGen2 at 3.44, and BAGEL at 3.20. The reported per-category scores are 4.78 for Replace, 4.57 for Remove, 4.69 for Style, 4.18 for Adjust, and 3.88 for Compose.

On image understanding and hallucination-oriented evaluation, the average across AI2D, GQA, POPE, MMMU, MathVista, ScienceQA, and Seedbench is 68.6%. The paper emphasizes stability across training stages, reporting 68.7 after SFT, 68.6 after Align, and 68.6 after RL. This directly counters the concern that optimizing a unified image generation and editing policy might degrade multimodal understanding.

The ablations isolate two main contributors. First, unified RL provides the best combined boost relative to generation-only or edit-only RL. Second, Edit-Align before RL increases ImgEdit from 4.08 to 4.31, whereas omitting Edit-Align yields only 4.08 to 4.29 under RL. Qualitative examples further indicate sharper fine-grained edits, such as “make the cat sit up,” and improved semantic alignment in object counts and positions.

Taken together, these results position UniGen-1.5 as a model whose main contribution is not only competitive benchmark performance but also a specific systems-level claim: a single MLLM can couple understanding, generation, and editing if the visual interface, masked-token objectives, caption-mediated edit alignment, and shared-reward RL are optimized as a unified training stack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniGen-1.5.