Papers
Topics
Authors
Recent
Search
2000 character limit reached

Skywork UniPic: Unified Multimodal Models

Updated 16 March 2026
  • Skywork UniPic is a family of unified multimodal models that integrate image understanding, text-to-image synthesis, editing, and multi-image composition.
  • The architecture evolves from autoregressive to diffusion-based transformers with innovative data-centric pipelines and unified sequence modeling techniques.
  • Empirical benchmarks show competitive performance at lower parameter counts and faster 8-step inference, enabling deployment on commodity hardware.

Skywork UniPic refers to a family of unified multimodal models developed by Skywork AI that systematically advance the state of the art in visual understanding, text-to-image generation, single-image editing, and (with UniPic 3.0) multi-image composition. Across three major generations, Skywork UniPic transitions from autoregressive architectures to diffusion-based transformers, introduces both architectural and data-centric innovations, and establishes new empirical benchmarks for multimodal models with tractable parameter counts and compute usage (Wei et al., 22 Jan 2026, Wei et al., 4 Sep 2025, Wang et al., 5 Aug 2025).

1. Evolution and Foundational Principles

The original Skywork UniPic (1.5B) (Wang et al., 5 Aug 2025) demonstrated that a single model could unify image understanding, high-fidelity text-to-image synthesis, and precise image editing within a compact framework. It was followed by UniPic 2.0 (Wei et al., 4 Sep 2025), which transitioned to a diffusion transformer (DiT) backbone and brought reinforcement learning (RL) into unified multimodal training. UniPic 3.0 (Wei et al., 22 Jan 2026) marks a further leap, explicitly addressing multi-image composition—especially Human-Object Interaction (HOI)—through a unified sequence-modeling paradigm for both editing and composition. This series systematically extends the model’s operational scope, supporting arbitrary resolutions, dynamic input counts, and integrated inference acceleration.

2. Architectural Innovations

UniPic 1.0

The first Skywork UniPic employs a decoupled encoding strategy to balance modality-specific and unified representation learning. It features:

  • A Masked Autoregressive (MAR) encoder–decoder pair for synthesis, targeting pixel fidelity.
  • SigLIP2 encoder for semantic tasks (understanding).
  • A shared Qwen2.5-1.5B-Instruct decoder, acting autoregressively on fused multimodal tokens.
  • Lightweight MLP projections connect encoders and decoder.

This design enables left-to-right token generation for all tasks, with active parameters confined to the LLM and projections after initial pretraining (Wang et al., 5 Aug 2025).

UniPic 2.0

Skywork UniPic 2.0 replaces the autoregressive decoder with a 2B-parameter DiT backbone. The reference image VAE latent is injected into every self-attention layer, and instruction text is processed by a frozen T5 encoder. A connector module aligns a frozen Qwen2.5-VL-7B MLLM with the DiT for joint unified training. The architecture supports both text-to-image generation and image editing, operational at arbitrary input aspect ratios and resolutions (Wei et al., 4 Sep 2025).

UniPic 3.0

UniPic 3.0 pioneers a fully unified diffusion-based approach capable of handling both text-guided single-image editing and multi-image HOI-centric composition tasks. Major architectural elements include:

  • Conditional encoding of text instructions via Qwen2.5-VL.
  • VAE encoding of all input images to latents, deterministically packed into sequences of patch tokens, with shape descriptors retaining spatial structure.
  • A backbone MMDiT transformer processes X=[packed latents | shape descriptors | condition tokens].
  • The entire multi-image composition task is cast as sequence-to-sequence denoising, governed by continuous-time flow matching and consistency objectives (Wei et al., 22 Jan 2026).

3. Training Methodologies and Data Pipelines

Progressive Training and Dynamic Unfreezing (UniPic 1.0)

Training proceeds through masked autoregressive pretraining, alignment via MLPs, joint cross-modal fine-tuning, and final reward-augmented supervised fine-tuning. The curriculum increases resolution and unfreezes parameters in staged fashion, with task-specific reward models (Skywork-ImgReward, Skywork-EditReward) guiding sample selection and RL objectives (Wang et al., 5 Aug 2025).

Progressive Dual-Task Reinforcement (UniPic 2.0)

A two-phase RL regime (Progressive Dual-Task Reinforcement, PDTR) trains diffusion models for both editing and generation. Flow-GRPO is extended to accommodate both tasks, with separate, staged reward functions (Skywork-EditReward for editing; GenEval plus automated detectors for generation). Empirically, editing RL is shown not to harm—and may mutually benefit—generation metrics (Wei et al., 4 Sep 2025).

Data Collection, Filtering, and Synthesis (UniPic 3.0)

UniPic 3.0 achieves sample efficiency through a multi-stage pipeline:

  • Collection: 18K person images (CC12M, auto-captioned), 150K HOI objects (prompted, synthesized via Qwen-Image).
  • Filtering: Quality (InternVL3.5 score >75), visibility (>90% face), minimum size (objects >768×768, CLIPScore >0.3).
  • Synthesis: Hybrid composition with up to K∈[2,6] images, guided by manually defined HOI conflict matrices and composition prompts from InternVL3.5. 2–3 input composites use Nano-Banana, 4–6 use Seedream 4.0 as synthesizers, with per-composite verification and re-synthesis as needed.
  • Dataset composition: 215K internal HOI composites + 150K MICo-150K + 381K editing samples yield ≈746K triplets for training (Wei et al., 22 Jan 2026).

4. Unified Sequence Modeling and Sampling Acceleration

The hallmark of UniPic 3.0 is its reformulation of multi-image composition (and by extension editing) as a unified sequence modeling problem in the latent space:

  • For K reference and a target image, each VAE latent is deterministically "packed" into patch tokens.
  • The model predicts clean latent sequences Fθ(St,t;H,text)F_\theta(S_t, t; H, \text{text}) from noisy inputs StS_t, minimizing loss against the ODE vector field (flow-matching).
  • Training comprises three phases: flow-matching pretraining, continuous-time consistency tuning, and distribution matching distillation (minimizing reverse KL between student's and teacher's few-step samplers).

Inference acceleration is achieved by post-training distillation into an 8-step sampler, using trajectory mapping for ODE alignment and further D_{KL} refinement. This yields a 12.5x speedup over naïve sampling (8 vs. 100+ steps) with no perceptible degradation in sample quality (Wei et al., 22 Jan 2026).

5. Empirical Performance and Benchmarking

UniPic 1.0 (1.5B)

  • GenEval: 0.86 (compositional generation)
  • DPG-Bench: 85.5 (complex generation)
  • GEditBench-EN: 5.83 (editing)
  • ImgEdit-Bench: 3.49
  • Comparable or superior to 7B–19B models at 1.5B parameters, with GPU memory requirements (<15GB for 1024×1024 inference) enabling commodity deployment (Wang et al., 5 Aug 2025).

UniPic 2.0 (2B+7B Metaquery)

  • GenEval: 0.90 (with Metaquery connector), 0.89 for UniPic2-SD3.5M-Kontext
  • DPG: 83.79 (Metaquery), 84.23 (Kontext)
  • GEdit-EN: 6.87 (Metaquery), 6.59 (Kontext)
  • ImgEdit: 4.03 (Metaquery), 4.00 (Kontext)
  • Outperforms much larger models (BAGEL, Flux-Kontext) on both generation and editing at lower parameter counts and supports multimodal tasks in a unified paradigm (Wei et al., 4 Sep 2025).

UniPic 3.0

  • ImgEdit-Bench: 4.35 (vs. Seedream 4.0: 4.11, Qwen-Image-Edit-2509: 4.31)
  • GEdit-Bench: 7.55 (Seedream 4.0: 7.66, Qwen-Image-Edit-2509: 7.61)
  • MultiCom-Bench (VIEScore, 2–3 inputs): 0.8214 (Nano-Banana: 0.7982, Seedream 4.0: 0.7997)
  • MultiCom-Bench (4–6 inputs): 0.6296 (Nano-Banana: 0.6466, Seedream 4.0: 0.6197)
  • Overall MultiCom-Bench: 0.7255 (Nano-Banana: 0.7224, Seedream 4.0: 0.7088)
  • Unmatched spatial consistency, occlusion handling, and instruction adherence for multi-image HOI scenes. Supports inputs 1–6 images at arbitrary resolutions up to a 1024×1024 pixel budget (Wei et al., 22 Jan 2026).

6. Comparative Summary of Key Features

Generation Backbone Core Tasks Supported Sequence Modeling RL Training Sample Efficiency/Acceleration Public Availability
UniPic 1.0 Autoregressive Understanding, Gen, Edit No Yes (SFT RL) Progressive unfreezing Yes
UniPic 2.0 DiT (2B, 7B) Gen, Edit, Multimodal (Metaquery) No Yes (PDTR) Efficient via connector Yes
UniPic 3.0 MMDiT-Diffusion Gen, Edit, HOI Multi-Composition Yes Implicit (flow/consist.) 8-step sampler (12.5x faster) Yes

This synthesis highlights the evolution of architectural, training, and data-centric methodologies across Skywork UniPic’s three generations.

7. Significance and Practical Implications

Skywork UniPic’s sequence-to-sequence approach to multi-image composition represents a foundational advance in conditional generative modeling. The unified model architecture, sample-efficient HOI datasets, and hybrid inference acceleration enable deployment on commodity hardware while attaining performance competitive with, or surpassing, much larger models. Furthermore, the explicit support for arbitrary input counts and resolutions, consistent handling of occlusions and composite spatial structure, and strong text-guided editing and composition suggest practical applicability to a wide spectrum of vision-language tasks, including professional design workflows and forensic compositing (Wei et al., 22 Jan 2026, Wei et al., 4 Sep 2025, Wang et al., 5 Aug 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skywork UniPic Model.