Unified Image-to-Image Generation (UniGen)

Updated 4 July 2026

UniGen is a unified approach that integrates text-to-image, editing, and controllable generation tasks under one framework using shared diffusion backbones, token conditioning, and expert modulation.
It employs architectural patterns such as shared backbones, unified multimodal encoder spaces, and autoregressive token models to handle heterogeneous image generation tasks with a consistent interface.
Staged training curricula and specialized loss functions enable UniGen systems to balance diverse tasks with high performance, though challenges in spatial control and resource cost remain.

Searching arXiv for papers on unified image generation / image-to-image generation. Unified image-to-image Generation (UniGen) denotes a research direction, and in some cases a model name, centered on performing multiple visual generation tasks within a single framework rather than maintaining task-specific models. In this literature, a unified system is typically expected to support text-to-image generation, instruction-based editing, controllable generation from structural conditions, personalization, and related image-to-image transformations through shared weights, a shared conditioning space, a shared objective, or a combination of these. The term appears both as a generic objective and as the title of specific systems, including the diffusion-based UniGen framework with Condition Modulated Experts and WeaveNet, the MLLM-based UniGen and UniGen-1.5 models for understanding, generation, and editing, and a broader family of unified generators such as UniVG, UniFusion, VisualCloze, SpatialFusion, CoLoGen, and HiDream-O1-Image (Zhang et al., 24 Aug 2025, Tian et al., 20 May 2025, Tian et al., 18 Nov 2025, Fu et al., 16 Mar 2025, Li et al., 14 Oct 2025, Li et al., 10 Apr 2025, Qiu et al., 29 Apr 2026, Song et al., 25 Feb 2026, Cai et al., 11 May 2026).

1. Historical development and scope

An early formulation of unification in controllable image translation appeared in a GAN setting, where a single generator $G(x, C_y)$ and discriminator family were used for multiple structure-guided translations such as hand gesture translation and cross-view image translation. In that setting, unification meant one generic architecture conditioned on a source image and a controllable structure, supported by color loss, controllable structure guided cycle-consistency loss, controllable structure guided self-content preserving loss, and the pairwise metric Fréchet ResNet Distance (FRD) (Tang et al., 2019).

Recent work broadened the scope from controllable translation to generalist multimodal generation. Diffusion-based systems such as UniVG explicitly frame unification as “a single set of weights” for text-to-image, inpainting, instruction-based editing, identity-preserving generation, layout-guided generation, depth estimation, pose estimation, and referring segmentation, all trained under the same latent flow-matching objective (Fu et al., 16 Mar 2025). UNIC-Adapter extends this idea to a single controllable generation adapter over SD3 medium, supporting pixel-level controls, subject-driven generation, and style-image-based synthesis through a unified image-instruction adapter rather than per-condition specialized models (Duan et al., 2024).

A second line of work shifts unification into the representation space. UniFusion conditions diffusion on a frozen large vision-LLM used as the only multimodal encoder, HiDream-O1-Image maps raw image pixels, text tokens, and task-specific conditions into a single shared token space inside a pixel-space Unified Transformer, and UGen converts both texts and images into discrete token sequences processed by a single autoregressive transformer (Li et al., 14 Oct 2025, Cai et al., 11 May 2026, Tang et al., 27 Mar 2025). A third line emphasizes task abstraction: VisualCloze reduces many image generation tasks to image infilling on a grid of demonstrations, and RealGeneral reformulates diverse image generation tasks as conditional frame prediction in a video model (Li et al., 10 Apr 2025, Lin et al., 13 Mar 2025).

Within this landscape, “UniGen” does not denote a single canonical architecture. The cited works use the term to describe a generic ambition—one model for heterogeneous image generation tasks—and, in specific papers, to name concrete architectures built around diffusion transformers, multimodal LLMs, or unified autoregressive token models (Zhang et al., 24 Aug 2025, Tian et al., 20 May 2025, Tian et al., 18 Nov 2025).

2. Architectural patterns in unified systems

Across the literature, unification is realized through several recurring architectural patterns rather than one fixed blueprint.

Pattern	Representative systems	Core mechanism
Shared diffusion backbone with unified conditions	UniVG, UNIC-Adapter, UniGen	One DiT/MM-DiT plus shared conditioning interface
Unified multimodal encoder space	UniFusion, HiDream-O1-Image	One VLM or one shared token space for text and images
Unified discrete autoregression	UGen, UniGen, UniGen-1.5	One transformer over text and image token sequences
Visual in-context or temporal reformulation	VisualCloze, RealGeneral	Tasks expressed as infilling or frame prediction
Specialized control augmentation	SpatialFusion, CoLoGen, LAC	Geometry, expert routing, or latent actions added to unified backbones

A backbone-centered formulation appears in UniVG, where the same latent diffusion transformer receives prompt embeddings, visual conditions, and masks through channel concatenation, with task identity controlled by special tokens such as <t2i>, <ie>, <depth>, <pose>, <seg>, and <lg> (Fu et al., 16 Mar 2025). The UniGen framework in “Condition Weaving Meets Expert Modulation” follows the same high-level logic but replaces per-condition branches with Condition Modulated Experts and introduces WeaveNet, a “snake-like” interaction between the backbone and control branch to bridge text-guided global features and condition-guided local features (Zhang et al., 24 Aug 2025). UNIC-Adapter instead adds a parallel image-instruction branch over SD3 medium and injects its outputs into MM-DiT through cross-attention enhanced by Rotary Position Embedding (Duan et al., 2024).

Encoder-centered formulations relocate unification to the conditioning space. UniFusion uses a frozen decoder-only transformer VLM as the only conditioning encoder and extracts multi-layer conditioning through Layerwise Attention Pooling and a bidirectional refiner, with VERIFI using VLM-generated rewritten text tokens during inference (Li et al., 14 Oct 2025). HiDream-O1-Image removes both the external VAE and the separate frozen text encoder, and treats raw image patches, text tokens, and condition tokens as one sequence inside a pixel-space decoder-only Transformer with hybrid attention, causal over condition and text tokens and full over generation tokens (Cai et al., 11 May 2026).

Autoregressive token unification is exemplified by UGen and the UniGen MLLM family. UGen uses a single TinyLlama-based decoder with a unified vocabulary over 32k text tokens, 16,384 visual tokens, and special symbols such as $[SOS]$ , $[EOS]$ , $[SOI]$ , $[EOI]$ , and $[MASK]$ (Tang et al., 27 Mar 2025). UniGen and UniGen-1.5 retain a shared LLM core but decouple continuous visual understanding from discrete image generation, using SigLIP-style or SigLIP2 encoders for understanding and MAGVIT-v2 or MAGViTv2 for discrete image tokens (Tian et al., 20 May 2025, Tian et al., 18 Nov 2025).

A plausible implication is that “unified” increasingly refers less to architectural minimalism than to interface consistency: one model may still contain multiple encoders, adapters, experts, or routing modules, provided task variation is handled without separate end-to-end models.

3. Conditioning and control mechanisms

The central technical challenge in UniGen systems is not merely joint training, but how heterogeneous controls are represented and injected. The cited works offer several distinct solutions.

One family uses direct latent or channel concatenation. UniVG defines

$d = [ z_t \oplus \mathrm{VAE}_{\mathrm{Enc}}(\mathcal V) \oplus \mathrm{Resize}(\mathcal M) ],$

so the noisy target latent, visual condition image, and binary mask are fused before the MM-DiT denoiser. This keeps sequence length fixed across text-to-image and editing and allows the same model to perform inpainting, editing, layout-guided generation, and auxiliary perception tasks (Fu et al., 16 Mar 2025). The specific UniGen framework of (Zhang et al., 24 Aug 2025) also keeps a single SD3.5 Medium backbone but processes condition features through CoMoE, where patch features are routed by expert scores $S_e = \mathrm{Linear}(F_n + F_c)$ , modulated by condition-type embeddings $E_c$ , and then re-injected through WeaveNet at every layer (Zhang et al., 24 Aug 2025).

A second family uses token-prepending or unified token streams. UniFusion prepends LAP-refined VLM conditioning tokens to the latent token sequence of the diffusion transformer and relies on full self-attention rather than cross-attention, while VERIFI uses only the rewritten text tokens generated by the VLM during in-model prompt rewriting for conditioning (Li et al., 14 Oct 2025). HiDream-O1-Image generalizes this further by mapping text, condition images, and noisy pixel patches into one shared token space; image-to-image editing is then simply a different sequence composition in which condition image tokens and instruction text precede generation tokens (Cai et al., 11 May 2026). UniGen-1.5 applies the same principle to editing by concatenating semantic condition tokens $\mathcal X_C^U$ , edit instruction $[SOS]$ 0, and discrete condition-image tokens $[SOS]$ 1 before masked prediction of the output image tokens $[SOS]$ 2 (Tian et al., 18 Nov 2025).

A third family emphasizes visual demonstrations rather than explicit task labels. VisualCloze represents the query and up to $[SOS]$ 3 in-context examples as a spatial grid of images, masks out a target cell, and lets a pre-trained infilling model recover the missing image. Because training randomly masks non-terminal cells as well, the same mechanism supports unseen task composition and reverse generation (Li et al., 10 Apr 2025). RealGeneral reaches a similar abstraction through time rather than space: a condition image becomes frame 1, the target image becomes frame 2, and image-to-image generation becomes next-frame prediction inside a video diffusion transformer, with a Unified Conditional Embedding and a Unified Stream DiT Block controlling modality interactions (Lin et al., 13 Mar 2025).

Several systems address a recurring weakness of unified generators: unreliable spatial control. SpatialFusion argues that unified image generation is limited by “geometry-deficient semantic representations in the MLLM” and “geometry-unconstrained diffusion synthesis,” and remedies this by adding a Mixture-of-Transformers spatial branch that predicts metric depth and a depth adapter that injects $[SOS]$ 4 into the diffusion latents through $[SOS]$ 5 (Qiu et al., 29 Apr 2026). LAC addresses a different form of control failure—where understanding does not become actionable—by introducing role-structured latent actions for planning, internal drafting, diagnosis, and refinement, all written back into the hidden stream that conditions generation (Zhai et al., 16 May 2026).

Taken together, these mechanisms show that unified generation requires a control pathway, but the literature disagrees on where that pathway should live: in channels, in tokens, in expert-routed condition branches, in geometry scaffolds, in visual demonstrations, or in latent reasoning trajectories.

4. Training curricula and optimization strategies

A common theme in UniGen research is that architectural unification alone is insufficient; most systems rely on staged curricula, specialized losses, or post-training alignment to prevent interference across tasks.

UGen addresses the difficulty of fitting a large unified visual vocabulary through progressive vocabulary learning. During multimodal pretraining, the active vocabulary $[SOS]$ 6 is gradually expanded from the text vocabulary $[SOS]$ 7, while inactive visual token IDs are replaced by $[SOS]$ 8. This curriculum reportedly yields a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method (Tang et al., 27 Mar 2025). VisualCloze likewise frames training as curriculum by task density rather than vocabulary: its Graph200K graph-structured dataset supplies 49 annotation types around each central image and supports up to 134 tasks, which the model learns through a single infilling objective (Li et al., 10 Apr 2025).

Diffusion-based unified systems often use explicit stage separation. UniVG trains in three stages: pure text-to-image foundation training, multi-task joint training over text-to-image, inpainting, outpainting, instruction editing, auxiliary tasks, and layout-guided generation, and then a final stage adding identity-preserving generation at a 1:1 ratio against all other tasks combined (Fu et al., 16 Mar 2025). SpatialFusion also uses two stages: geometric-aware pretraining of the Spatial Transformer under metric depth supervision from VGGT, followed by geometry-guided joint training of depth prediction and diffusion with $[SOS]$ 9, with $[EOS]$ 0 chosen as the best balance (Qiu et al., 29 Apr 2026). CoLoGen makes staging itself the main representational device: it first learns concept-heavy behavior through mask inpainting, then localization through image grounding, then control injection, then customized generation and instruction editing, while progressively adding and freezing experts under veteran gate routing supervision (Song et al., 25 Feb 2026).

The UniGen MLLM line adds post-training alignment on top of joint pretraining and SFT. UniGen studies pre-training, supervised fine-tuning, direct preference optimization, and then Chain-of-Thought Verification at test time, where the model generates multiple images and uses its own multimodal reasoning to select better ones (Tian et al., 20 May 2025). UniGen-1.5 replaces test-time verification with unified reinforcement learning through GRPO and shared reward models, and inserts a light Edit Instruction Alignment stage to improve instruction comprehension before RL (Tian et al., 18 Nov 2025). In that system, both text-to-image and editing are optimized with the same ensemble reward, averaged over CLIP-H, HPSv2, UnifiedReward-7B, and ORM (Tian et al., 18 Nov 2025).

A broader inference from these curricula is that unification is usually learned progressively: concept acquisition, localization, control transfer, instruction following, and preference alignment are rarely solved in one homogeneous optimization stage.

5. Empirical performance and benchmark behavior

Reported results indicate that unified generators are no longer merely convenience models; several now match or exceed specialized baselines on standard generation and editing benchmarks.

On general text-to-image and editing evaluation, UniGen-1.5 reports 0.89 on GenEval and 4.31 on ImgEdit, surpassing BAGEL and reaching performance comparable to GPT-Image-1, while also reporting 86.83 on DPG-Bench (Tian et al., 18 Nov 2025). The earlier UniGen reports a final score of 0.78 on GenEval and 85.19 on DPG-Bench, with gains attributed to its multi-stage pipeline and CoT-V-based Best-of-N selection (Tian et al., 20 May 2025). HiDream-O1-Image reports 0.90 on GenEval for the 8B model and 0.92 for HiDream-O1-Image-Pro, with the 200B+ version also reporting 4.51 on ImgEdit and leading results on CVTG-2K and LongText-Bench (Cai et al., 11 May 2026).

In diffusion-based unified control and editing, UniVG reports GenEval 0.70, T2I-CompBench 0.48, DSG 0.75, and HPSv2 28.2, while also reporting MagicBrush scores of CLIP-T 29.5 and CLIP-I 86.3, EmuEdit scores of CLIP-T 25.9 and CLIP-I 84.7, and Unsplash-50 identity similarity 0.329 (Fu et al., 16 Mar 2025). SpatialFusion reports a best overall average score of 46.33 on GenSpace text-to-image, compared with GPT-4o 43.22 and OmniGen2 31.78, and an editing average score of 35.40, compared with GPT-4o 31.91 and OmniGen2 26.18 (Qiu et al., 29 Apr 2026). These numbers matter because they target explicitly spatial tasks—pose, relation, and measurement—where unified generators had been especially weak.

The specific UniGen framework based on CoMoE and WeaveNet reports state-of-the-art performance across Subjects-200K and MultiGen-20M, including mean FID 12.15 on Subjects-200K over Depth, Canny, and OpenPose, and mean FID 10.57 over nine additional MultiGen-20M conditions, while using 4.69B parameters and 13.96 inference-time units for 12 conditions, compared with 17.38B and 59.16 for ControlNet-style scaling (Zhang et al., 24 Aug 2025). RealGeneral reports a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image, supporting the claim that video backbones can unify several image generation tasks efficiently (Lin et al., 13 Mar 2025).

Not all performance advances come from explicit control branches. UniFusion reports that fine-tuning on editing improves text-image alignment for generation and zero-shot generalization to multiple image references, which the paper interprets as cross-modality knowledge transfer from a unified encoder (Li et al., 14 Oct 2025). VisualCloze shows that adding in-context examples improves both seen tasks and unseen task compositions, and that the same infilling model can solve reverse generation problems by masking different grid cells (Li et al., 10 Apr 2025). LAC reports consistent improvements over BAGEL-7B-MoT on GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts (Zhai et al., 16 May 2026).

6. Limitations, conceptual tensions, and open directions

The literature also makes clear that unification remains incomplete. One recurring limitation is cost. UniFusion notes that running an 8B VLM with autoregressive rewriting on every generation is costly, HiDream-O1-Image scales the paradigm to over 200B parameters but thereby highlights the resource intensity of pixel-space unified models, and CoLoGen explicitly identifies memory growth as more experts are added across tasks (Li et al., 14 Oct 2025, Cai et al., 11 May 2026, Song et al., 25 Feb 2026).

A second limitation concerns the meaning of “unified.” Some systems truly use one set of weights for many tasks, as in UniVG, while others retain condition-specific LoRAs or stage-specific experts, as in RealGeneral and CoLoGen (Fu et al., 16 Mar 2025, Lin et al., 13 Mar 2025, Song et al., 25 Feb 2026). This suggests that unification is better understood as a spectrum. At one end are monolithic token-space models such as HiDream-O1-Image; at the other are shared-backbone systems that still require modular routing, adapters, or auxiliary encoders.

A third tension is the trade-off between instruction adherence, geometric fidelity, and preservation of source content. SpatialFusion shows that increasing the depth-loss weight improves depth prediction but can hurt image quality if too high (Qiu et al., 29 Apr 2026). The UniGen condition-weaving framework reports slightly lower CLIP-T on some conditions because WeaveNet reduces over-reliance on prompts in favor of condition consistency (Zhang et al., 24 Aug 2025). UniGen-1.5 identifies visual consistency in editing as an open problem because its unified reward emphasizes text-image alignment but does not explicitly reward preservation of unchanged regions (Tian et al., 18 Nov 2025).

Several future directions recur across papers. One is richer spatial representation: SpatialFusion suggests moving beyond single-view metric depth toward normals, camera poses, point clouds, or NeRF-like signals (Qiu et al., 29 Apr 2026). Another is broader multimodal unification: HiDream-O1-Image points toward larger natively unified architectures, and LAC suggests that unified generation benefits when understanding is made actionable through internal control states rather than only encoded passively (Cai et al., 11 May 2026, Zhai et al., 16 May 2026). VisualCloze points to retrieval or better selection of in-context examples for more stable unseen-task generalization (Li et al., 10 Apr 2025). UniGen-1.5, finally, points toward dedicated editing rewards and more explicit source-image consistency objectives in unified RL (Tian et al., 18 Nov 2025).

A common misconception is that UniGen simply means “one model for many prompts.” The technical record suggests a stricter definition: unified image-to-image generation is the attempt to convert heterogeneous conditions, tasks, and objectives into a shared generative process without collapsing control fidelity, spatial precision, or semantic reasoning. The main unresolved question is not whether such unification is possible, but which internal abstraction—shared token space, unified encoder, expert modulation, geometric scaffold, visual in-context demonstration, or latent action policy—best preserves performance as the task set expands.