Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

X-Omni: Unified Image–Language Generation

Updated 30 July 2025
  • X-Omni is a unified image–language generative framework that integrates semantic image tokenization and a unified autoregressive model to process both visual and textual data.
  • It employs Group Relative Policy Optimization (GRPO) to directly optimize token generation, enhancing image fidelity, instruction adherence, and accurate text rendering.
  • The framework achieves state-of-the-art performance on multimodal benchmarks, demonstrating robust capabilities in complex text-to-image and long text rendering tasks.

X-Omni refers to a unified image–language generative framework that leverages discrete autoregressive modeling, semantic image tokenization, and reinforcement learning (RL) optimization to address major barriers in integrating high-fidelity image and language generation. This approach is characterized by its ability to process and generate both words and visual content using a single large-scale LLM, significantly improving image quality, instruction adherence, and text rendering compared to previous autoregressive methods. X-Omni stands out for employing RL—specifically Group Relative Policy Optimization (GRPO)—to directly optimize the autoregressive policy for final image quality as judged by sophisticated, composite reward functions, overcoming the cumulative error and distribution mismatch endemic to prior unified models (Geng et al., 29 Jul 2025).

1. Architectural Components

Semantic Image Tokenization (SigLIP-VQ)

X-Omni employs SigLIP-VQ, a semantic image tokenizer that discretizes 2D images into sequences of tokens. This component uses a pre-trained SigLIP2-g vision transformer to extract rich semantic features and applies a vector quantizer with a codebook size of 16,384 (embedding dimension 2,048), yielding highly informative but compressible discrete image tokens. This process ensures the preservation of semantic content necessary for subsequent autoregressive modeling and aligns image token semantics with those of language tokens.

Unified Autoregressive Model

The model employs the Qwen2.5-7B LLM backbone, modified with vision-specific transformer blocks inserted before and after the main language layers. Both text and image tokens are concatenated into a single sequence. A special resolution prefix, denoted as “<SOM> height width <Image> … <EOM>”, specifies the boundaries and spatial properties of the embedded image, supporting correct multi-turn and multi-modal processing. Processing is performed using unified 1D RoPE positional encodings across all modalities, allowing consistent sequence modeling and compatibility with distributed training.

Offline Diffusion Decoder

After the model generates discrete semantic image tokens, an offline diffusion decoder—conditioned on projected embedding tokens (e.g., using FLUX.1-dev)—reconstructs high-fidelity images from the token stream. The mapping layer translates the semantic embeddings to the correct feature dimensions required by the diffusion model. This two-stage approach decouples high-level reasoning and compression (autoregressive model) from fine detail synthesis (diffusion model).

2. Reinforcement Learning Optimization: Group Relative Policy Optimization (GRPO)

Autoregressive image generation historically suffers from token-level error accumulation and a mismatch between sequence distributions encountered during training versus inference (exacerbated by the “exposure bias” problem). In X-Omni, RL is applied to align the generated image token stream distribution with that expected by the diffusion decoder and evaluated for final image quality.

The GRPO algorithm is employed as follows:

  • For each prompt pp sampled from the prompt distribution D\mathcal{D}, GG trajectories (candidate token sequences) are sampled from the current policy.
  • Each resulting sequence is decoded by the fixed diffusion network to an image, and assigned a scalar reward reflecting a composite of criteria: aesthetic quality, instruction fidelity, and text rendering accuracy (evaluated, for instance, by vision-LLMs like Qwen2.5-VL-32B).
  • Advantages AiA_i are computed by normalizing rewards within the group, and the policy is updated using the clipped objective:

JGRPO(θ)=EpD,{oi}i=1Gπθold(p)[1Gi=1G(min(πθ(oip)πθold(oip)Ai,clip(πθ(oip)πθold(oip),1ϵ,1+ϵ)Ai)βDKL(πθ    πθref))]\begin{split} \mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{p \sim \mathcal{D},\, \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{old}}(\cdot|p)} \left[ \frac{1}{G}\sum_{i=1}^{G} \Biggl( \min\Bigl( \frac{\pi_\theta(o_i|p)}{\pi_{\theta_{old}}(o_i|p)} A_i, \text{clip}\Bigl(\frac{\pi_\theta(o_i|p)}{\pi_{\theta_{old}}(o_i|p)}, 1-\epsilon, 1+\epsilon\Bigr) A_i \Bigr) - \beta\, \mathbb{D}_{KL}\bigl(\pi_\theta \;\|\; \pi_{\theta_{ref}} \bigr) \Biggr) \right] \end{split}

where πθref\pi_{\theta_{ref}} is a reference policy and β\beta is a KL regularization weight. This approach bypasses the need for training a critic network, reducing computation.

This RL-based objective directly optimizes the autoregressive token generator for the end-display image quality, guiding it toward producing token distributions that create visually coherent, instruction-aligned images.

3. Unified Image–Language Generative Modeling

X-Omni’s core paradigm treats both image and text tokens as elements in a shared autoregressive sequence. By prepending explicit cues about the resolution and employing a consistent positional encoding, the model avoids the heterogeneity issues that previously precluded single-architecture solutions. Dedicated transformer blocks for vision tokens minimize interference with pure language processing, while unified next-token prediction enables both modalities to benefit from generalization.

Images, after discretization by the semantic tokenizer, are appended at arbitrary places within the token stream, supporting arbitrary positioning and interleaving with natural language, thus accommodating complex multi-turn and cross-modal scenarios.

This arrangement facilitates seamless integration, allowing the model to perform both generation (“rendering a new image guided by text”) and understanding (“classifying or reasoning about visual content in tokenized form”) in a single forward pass.

4. Performance Outcomes and Evaluation

X-Omni achieves state-of-the-art results on various standardized image generation and multimodal benchmarks:

  • On OneIG-Bench’s text rendering (generation of images with readable, accurate embedded text), X-Omni achieves scores of ~0.901 (English) and ~0.895 (Chinese), setting a new standard for unified autoregressive models.
  • On text-to-image generation datasets such as DPG-Bench and GenEval, X-Omni demonstrates superior or competitive overall scores compared to leading open and closed models (e.g., OmniGen2, Janus-Pro).
  • The model’s ability to faithfully render long and complex texts as visual elements is confirmed by best-in-class results on the LongText-Bench, a benchmark specifically targeting long text rendering within generated images.

These results demonstrate that X-Omni’s RL approach substantially mitigates cumulative error and alignment issues, yielding high-fidelity, instruction-following image generation with strong text rendering—capabilities that were previously elusive in discrete autoregressive frameworks.

5. Instruction Following and Advanced Text Rendering

X-Omni exhibits strong instruction following due to reward-based fine-tuning during RL. Human preference scores and image-text alignment metrics are tightly integrated into the reward function, driving the model to render images that accurately reflect both the style and semantics of the input prompt.

Long text rendering in images, a historically challenging task (especially for token-based autoregressive models), is specifically enhanced via the RL stage’s focus on text rendering accuracy. This allows X-Omni to generate images such as posters, infographics, and complex layouts where visual structure and textual information must be tightly coupled.

A sample full sequence format utilized during training and inference is:

  • “language tokens <SOM> height width <Image> visual tokens <EOM> language tokens”

This convention, combined with explicit RL rewards on both global visual attributes and detailed textual rendering, enables the model to excel at tasks that require intricate multi-modal coordination.

6. Impact, Limitations, and Prospects

The X-Omni approach demonstrates that with proper semantic tokenization and RL-based sequence optimization, discrete autoregressive models can achieve high visual fidelity, strong alignment to complex instructions, and precise text rendering, overcoming longstanding limitations.

A plausible implication is that this framework revives credible interest in unified token-based multimodal generation, which had recently been abandoned due to quality constraints in favor of hybrid diffusion/autoregressive schemes.

X-Omni’s design and methods are positioned to catalyze further work on discrete token learning for vision–LLMs, facilitate robust multi-turn interactive systems that require deep cross-modal coherence, and inspire architectural and optimization strategies for future unified generative models.

All claims, empirical results, and architectural descriptions are sourced from (Geng et al., 29 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)