Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Rex-Omni: 3B-Scale Multimodal Model

Updated 15 October 2025
  • Rex-Omni is a 3B-scale multimodal model that reformulates visual tasks as token sequences to streamline detection, OCR, and object referencing.
  • It leverages the Qwen2.5-VL backbone and discrete coordinate tokens to convert regression tasks into structured language modeling outputs.
  • With robust zero-shot performance on benchmarks like COCO and LVIS, it sets new standards for unified cross-modal visual reasoning.

Rex-Omni refers to a 3B-scale multimodal LLM (MLLM) designed to unify and excel in a broad spectrum of visual perception tasks—including object detection, object referring, visual prompting, OCR, and spatial grounding—by leveraging next-point prediction via discrete coordinate tokens mapped to the model’s vocabulary. Built on the Qwen2.5-VL backbone, Rex-Omni reformulates conventional regression-based visual tasks into structured language-model-style token sequences, achieving performance comparable or superior to state-of-the-art detectors in zero-shot settings while maintaining robust semantic understanding.

1. Architectural Overview and Coordinate Prediction Paradigm

Rex-Omni adopts the Qwen2.5-VL architecture in its 3B-parameter configuration, introducing minimal but impactful modifications. The core innovation is the repurposing of 1,000 vocabulary tokens to quantized image coordinates in the range [0, 999]. Each spatial prediction (bounding box or key point) is a sequence of tokens representing discrete coordinates, predicted autoregressively:

  • For bounding boxes: <box_start> ⟨x₀⟩ ⟨y₀⟩ ⟨x₁⟩ ⟨y₁⟩ <box_end>
  • For points: <point_start> ⟨x⟩ ⟨y⟩ <point_end>

This quantization-based approach simplifies the mapping between visual locations and the model’s output space, reducing sequence length and learning complexity relative to digit-tokenized or continuous regression approaches. All tasks—including complex ones like polygonal OCR outputs—are handled under this coordinate prediction framework.

2. Task Formulation, Data Engines, and Training Protocols

Rex-Omni replaces standard regression modeling with sequence generation in all vision tasks. The use of special coordinate tokens allows for efficient and unified prediction over detection, referring, pointing, and annotation tasks.

  • Data Engines: Multiple specialized engines synthesize supervision data for grounding, referring, pointing, and text-region annotation. This high-volume, semantically diverse data (22 million instances for SFT) ensures strong cross-modal alignment and supports context-dependent reasoning.
  • Training Pipeline:
    • Cross-entropy loss over coordinate tokens.
    • Teacher forcing; outputs are guided by ground-truth coordinate sequences.
    • 2. Reinforcement Post-Training (GRPO):
    • Geometry-aware rewards (IoU, duplicate suppression) refine the model’s outputs.
    • The GRPO objective incorporates group advantage:

      Ai=rimean(r1,,rG)std(r1,,rG)A_i = \frac{r_i - \operatorname{mean}(r_1,\ldots,r_G)}{\operatorname{std}(r_1,\ldots,r_G)}

      JGRPO(θ)=1Gi=1G1oit=1oi[min(ρi,tA^i,t,clip(ρi,t,1ϵ,1+ϵ)A^i,t)βDKL[πθ(o<t)πref(o<t)]]\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \Biggl[\min\bigl(\rho_{i,t} \hat{A}_{i,t}, \operatorname{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t}\bigr) - \beta \mathbb{D}_{\text{KL}}\bigl[\pi_\theta(\cdot | o_{<t})\|\pi_{\text{ref}}(\cdot|o_{<t})\bigr]\Biggr]

      • Policy gradient with KL regularization prevents deviation from pretrained behavior.

GRPO post-training is especially effective at correcting SFT-induced errors, including duplicate predictions and size misalignment in dense scenarios.

3. Benchmarking and Performance Metrics

Rex-Omni demonstrates competitive and in some cases superior zero-shot performance on COCO and LVIS benchmarks, without direct benchmark-targeted training.

  • COCO: F1 scores at IoU thresholds (0.5, 0.95, mIoU) rival or surpass regression-based detectors (DETR, YOLO) and open-vocabulary models (Grounding DINO).
  • LVIS: Strong language understanding enables robust detection of rare or long-tail categories through flexible text queries.
  • The two-stage pipeline notably enhances coordinate precision and reduces duplicate or large-box errors, verified by quantitative increases in mIoU and recall under zero-shot conditions.

A plausible implication is that discrete coordinate prediction via tokenization, combined with RL finetuning, can meet or exceed the localization fidelity of continuous regression for complex multimodal tasks.

4. Unified Cross-Modal Reasoning and Task Versatility

Rex-Omni’s design supports diverse tasks beyond standard detection:

  • Object Referring: Interprets complex referring expressions, outperforming open-set detectors in handling context and relationship-based localization.
  • Visual Prompting: Receives image-region cues in token format to guide the detector; capable of discovering objects similar to those in the prompt elsewhere in the image.
  • OCR and Key-Pointing: Generates polygonal or multi-point token sequences for text regions, achieving performance on par with specialized OCR systems on dedicated tasks.
  • GUI Grounding and Spatial Referring: Handles multi-modal queries combining spatial and semantic language, with data engines providing matched supervision.

The framing of all visual tasks as token sequence completion enables deep integration of language and spatial understanding, enhancing generalization to language-rich or context-sensitive scenarios.

5. Comparative Analysis with Regression-Based and Open-Vocabulary Detectors

Compared with traditional detectors (YOLO, DETR):

  • Rex-Omni’s discrete token prediction does not rely on engineered feature pyramids or proposal networks.
  • Detection quality and recall match or exceed open-vocabulary approaches such as Grounding DINO, particularly in zero-shot settings.
  • The unified text-based interface supports broader semantic and spatial generalization, as opposed to category- or region-level modularity in specialized detectors.
  • No auxiliary region proposal or external encoders are needed; all outputs are generated autoregressively by the LLM.

These findings support the use of MLLMs for multimodal detection and annotation tasks without the typical regression bottlenecks or modular architectures.

6. Mathematical Formalization of Coordinate Tokenization and GRPO

The model’s output space is formalized as:

  • Discrete coordinate tokens: Coord{0,1,...,999}\text{Coord} \in \{0, 1, ..., 999\}.
  • Group advantage for RL: Ai=rimean(r1,,rG)std(r1,,rG)A_i = \frac{r_i - \operatorname{mean}(r_1,\ldots,r_G)}{\operatorname{std}(r_1,\ldots,r_G)}
  • GRPO objective: combines geometry-aware policy gradients and KL-regularization, maintaining output diversity while suppressing deviations from pretrained behavior:

JGRPO(θ)=1Gi=1G1oit=1oi[min(ρi,tA^i,t,clip(ρi,t,1ϵ,1+ϵ)A^i,t)βDKL[πθ(o<t)πref(o<t)]]\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \biggl[\min\bigl(\rho_{i,t} \hat{A}_{i,t}, \operatorname{clip}(\rho_{i,t},1-\epsilon,1+\epsilon)\hat{A}_{i,t}\bigr) - \beta\, \mathbb{D}_{\text{KL}}\bigl[\pi_\theta(\cdot|o_{<t})\|\pi_{\text{ref}}(\cdot|o_{<t})\bigr]\biggr]

This framework systematizes the alignment between discrete spatial modeling and language generation, leveraging RL for behavior regularization and precision.

7. Significance and Implications

Rex-Omni embodies a comprehensive fusion of large-scale language modeling and visual perception, setting a precedent for unified multimodal understanding under a next-point prediction paradigm. Its performance across standard detection, rare-category discovery, OCR, and context-sensitive tasks, achieved via a quantized coordinate token interface and geometry-aware RL, highlights its practical and technical robustness.

  • The model’s versatility enables direct application to vision-language challenges previously dominated by domain-specific regression architectures.
  • A plausible implication is that further scaling and refinement of the coordinate token approach, combined with improved RL techniques and more diverse data engines, could extend Rex-Omni’s capabilities across even broader perceptual domains.

These findings collectively indicate that the Rex-Omni framework is poised to facilitate more flexible, language-aware, and spatially precise multimodal systems in both research and applied contexts (Jiang et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Rex-Omni.