Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ming-Lite-Uni: Multimodal Open Framework

Updated 30 June 2025
  • Ming-Lite-Uni is an open-source multimodal framework that unifies text and image processing using a diffusion-based generator integrated with autoregressive models.
  • It features multi-scale learnable tokens and representation alignment mechanisms to ensure semantic consistency and high-fidelity visual outputs.
  • The system supports versatile applications, including instruction-based image editing and text-to-image synthesis, serving as a core module in unified multimodal architectures.

Ming-Lite-Uni is an open-source multimodal framework designed to unify visual generation and multimodal autoregressive modeling, targeting seamless integration of vision and language within a single, extensible architecture. The system combines a fixed large multimodal LLM (MLLM), a learnable diffusion-based visual generator, and architectural innovations such as multi-scale learnable tokens, multi-scale representation alignment, and connectors for bridging latent representations. Ming-Lite-Uni enables highly fluid, instruction-driven interaction capabilities—including high-fidelity text-to-image synthesis and multi-turn image editing—and serves as both an independent research platform and a critical component in larger unified models such as Ming-Omni.

1. Architectural Design: Unified Visual Generator and Autoregressive Model

Ming-Lite-Uni centers on a unified architecture that tightly couples a diffusion-based image generator with a multimodal autoregressive (AR) backbone. The diffusion generator is responsible for producing synthetic images at high resolution, guided by semantically rich latent variables derived from the AR model. The AR backbone, based on state-of-the-art multimodal LLMs, operates natively on text and image tokens, learning the alignment and optimal transition dynamics between modalities.

The visual generator employs diffusion models (e.g., SANA, DiT variants) capable of conditioning on multi-scale tokenized representations. Training utilizes an integrated FlowMatching loss:

LFlowMatching=E[ϕ(x)ϕ(z)2]\mathcal{L}_{\text{FlowMatching}} = \mathbb{E} \left[ \| \phi(\mathbf{x}) - \phi(\mathbf{z}) \|^2 \right]

where ϕ\phi extracts hidden features at various scales, ensuring alignment between intermediary semantic states and final outputs.

In this design, the AR backbone is frozen while the diffusion and multi-scale token modules are fine-tuned, preserving the original capabilities for visual and textual understanding while extending generation and editing performance.

2. Integration of MetaQueries and M2-omni Frameworks

Ming-Lite-Uni implements and extends the MetaQueries and M2-omni frameworks. MetaQueries introduce dynamically adaptable “query tokens” that facilitate cross-modal information transfer and flexible bridging between modalities. M2-omni equips the AR model with distinct text/image branches and incorporates multi-modal positional encoding schemas (notably M-RoPE) for reliable integration and scaling.

By freezing the MLLM during image generator training, Ming-Lite-Uni ensures stability and prevents catastrophic forgetting of previously learned capabilities, allowing seamless bi-directional transitions between understanding and generation. This also enables a broader set of advanced tasks including conversational image generation, description, and editing—all within a unified token and loss framework.

3. Multi-Scale Learnable Tokens and Representation Alignment

A distinctive core feature of Ming-Lite-Uni is its systematic use of multi-scale learnable tokens for visual representation. For an image input xx, token sequences are constructed at several scales S={s1,s2,...,sK}\mathcal{S} = \{s_1, s_2, ..., s_K\}, with each scale sks_k associated with its own set of learnable tokens QskRNsk×dQ_{s_k} \in \mathbb{R}^{N_{s_k} \times d}.

Each sequence is bracketed by scale-delineated start/end tokens and augmented via scale-wise positional encodings:

Qsk,final=[Startsk,Qsk,Endsk]+PEskQ_{s_k, \text{final}} = [\text{Start}_{s_k}, Q_{s_k}, \text{End}_{s_k}] + \text{PE}_{s_k}

All scale-augmented token sequences are concatenated and input into a transformer encoder, permitting joint modeling of dependencies across scales. The output hidden states are then used by the diffusion generator for coarse-to-fine image synthesis.

To enforce semantic and visual consistency, a multi-scale representation alignment loss is introduced:

Lalign=kHskHfinalsk22\mathcal{L}_{\text{align}} = \sum_{k} \| H_{s_k} - H_{\text{final}}^{s_k} \|_2^2

where HskH_{s_k} denotes the hidden state at scale kk. This alignment is empirically shown to improve reconstruction quality (noted as >2dB PSNR gains) and to close the gap between semantic AR guidance and visual fidelity.

4. Text-to-Image Generation and Instruction-Based Image Editing

Ming-Lite-Uni supports a range of native tasks including text-to-image synthesis, instruction-based image editing, and compositional or multi-turn vision-language interactions. In typical use:

  1. The AR MLLM encodes multimodal context (text, images).
  2. Contextual information is transformed into multi-scale token sequences, encoding layout, structure, and detailed cues.
  3. A connector module projects these tokens/latents into the input space for the diffusion generator.
  4. The diffusion process samples the output image, guided via multi-scale alignment losses for enhanced semantic consistency.

Instruction-based editing is realized by providing an image, a text instruction, and leveraging the AR model and diffusion generator to realize faithful modifications (e.g., object addition/removal, detailed attribute adjustment), preserving uninstructed details.

5. Empirical Evaluation and Benchmark Performance

Ming-Lite-Uni demonstrates strong or leading performance across textual, visual, and multimodal benchmarks:

  • Understanding: Outperforms or matches closed and open models (GPT-4o, Gemini-1.5-Pro, LLaVA, Qwen2-VL, Chameleon, Janus) on MMB, MMS, MMMU, AI2D, MathVista, MM-Vet, and HallusionBench.
  • Generation: Achieves a GenEval score of 0.62, outperforming MetaQueries (0.61), Show-o (0.53), TokenFlow-XL (0.55), and is competitive with SDXL (0.55) and DALL-E 3 (0.67).
  • Image Editing: Delivers high precision and fluency in complex instruction-based editing with competitive results on MagicBrush, SEED, UltraEdit, and style transfer datasets.

A summary table extracted from the data:

Model GenEval (Overall) MMBench (VQA) Instruction Editing Open/Closed
Ming-Lite-Uni (Ours) 0.62 80.7 ✓ (High fluency) Open
MetaQueries 0.61 -- -- Open
Janus-Pro-1B 0.73 79.2 -- Open
DALL-E 3 0.67 -- -- Closed
SDXL 0.55 -- -- Open

Ming-Lite-Uni is noted for maintaining both strong comprehension and generation/editing capabilities, contrasting with most models that trade capacity between the modalities.

6. System Integration and Open-Source Significance

Ming-Lite-Uni is provided with all code and model weights open-sourced (GitHub link), promoting reproducibility, extensibility, and experimental innovation. The system forms a foundational module in larger unified models (e.g., Ming-Omni, which relies on Ming-Lite-Uni for high-fidelity visual generation alongside advanced audio and text capabilities).

Within Ming-Omni, Ming-Lite-Uni operates as a distinct module receiving context from the MoE-based “Ling” core. Decoupling the generation module from the frozen MLLM avoids negative transfer and supports dynamic, context-grounded visual tasks. This modularity and design promote scalability, task generality, and robust community engagement.

7. Technical Distinctions and Practical Impact

Ming-Lite-Uni is characterized by:

  • Decoupled training, preserving AR/LLM understanding while advancing generation quality.
  • Systematic fusion of multi-scale, learnable token representations, facilitating coarse-to-fine, semantically controlled generation.
  • Explicit representation alignment between AR latent states and diffusion-generated images for improved controllability and fidelity.
  • Support for a unified conversational pipeline that encompasses both image and language generation/editing, extensible to inclusion within broader multimodal systems.

These features position Ming-Lite-Uni as a key open architecture for natural multimodal AI, both as a standalone research testbed and as a core building block toward unified AGI frameworks.