Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniVerifier-7B: Universal Visual Verifier

Updated 21 January 2026
  • OmniVerifier-7B is a 7B-parameter generative visual verifier that uses reinforcement learning to generate binary judgments, chain-of-thought, and detailed explanations from image-prompt pairs.
  • It leverages a unified vision–language transformer architecture with a frozen ViT-style encoder and a transformer decoder to process and verify multimodal inputs.
  • Its sequential test-time scaling refines initial outputs iteratively, significantly boosting performance on explicit alignment and relational verification tasks.

OmniVerifier-7B is a 7 billion-parameter generative visual verifier designed to function as a universal visual verification engine for unified multimodal reasoning and generation. Trained on a balanced corpus of true/false image-prompt pairs with reinforcement learning (RL), it produces structured verification outputs—binary judgments, chain-of-thought, detailed explanations, and, in sequential usage, edit instructions—enabling reflection, refinement, and increased reliability in complex multimodal models. It constitutes the first omni-capable, generative verifier designed for robust, end-to-end evaluation and optimization of visual outcomes, supporting both reasoning and stepwise improvement in contemporary vision-language systems (Zhang et al., 15 Oct 2025).

1. Model Architecture and Modalities

OmniVerifier-7B is based on Qwen2.5-VL-7B, a unified vision–language transformer architecture comprising 7 billion parameters. Its vision encoder is a frozen ViT-style module, projecting input RGB images into embedding tokens without further in-model adaptation. The language decoder stack (transformer layers with cross-attention to visual embeddings) processes concatenated image and prompt inputs.

Accepted inputs per instance are:

  • An RGB image,
  • A natural-language prompt specifying the verification query.

The model generates output in the following structured format:

  • a binary “true”/“false” judgment,
  • a short chain-of-thought justification,
  • when the answer is “false,” a natural-language explanation, and
  • in test-time sequential (TTS) usage, an edit instruction derived from the failed case.

Distinctively, OmniVerifier-7B introduces no new architectural components or adapters beyond the Qwen2.5-VL-7B backbone. The key innovation is RL-based fine-tuning to endow the backbone with generative verification ability on image-prompt pairs.

2. Data Curation and Automated Training Pipelines

Large-scale, high-quality verification data—specifically, complex image-prompt true/false pairs—were curated using two automated pipelines applied to a dataset comprising 20,000 LVIS natural images and 20,000 Seedream 3.0 synthetic renderings.

Automated Verification Pipelines

  • Image-Fixed, Prompt-Modified Pipeline:
  1. Generate a faithful prompt referencing only visually verifiable elements using GPT-5, labeling the resulting (image, prompt) as a true instance.
  2. Modify the prompt automatically using GPT-5 (object/attribute/relation edits), generating a matching explanation to yield false-instance pairs.
  • Prompt-Fixed, Image-Inpainting Pipeline:
  1. Use SAM 2.1 to segment images by object masks, selecting segmentation difficulty via object area.
  2. Maintain the original prompt, flagging focus bounding boxes in text.
  3. Utilize FLUX.1-dev to inpaint or remove objects, yielding a false image while keeping the prompt fixed.

All generated pairs are subjected to a filtering stage: Seed 1.5-VL model in Best-of-10 voting. Only samples with at least 60% verification accuracy are retained. This produces a final, finely balanced dataset of approximately 28,000 true/false pairs, covering explicit alignment and relational verification scenarios.

3. Training Objectives and Reinforcement Learning

OmniVerifier-7B employs the DAPO RL system with rewards tailored for verification accuracy and output formatting:

  • Rule-based reward RruleR_{\mathrm{rule}}: set to 1 if the binary answer matches ground truth, else 0.
  • Format reward RfmtR_{\mathrm{fmt}}: set to 1 if the model output matches the prescribed (JSON + explanation) format.

Total reward is a weighted combination: Rtotal(τ)=0.9Rrule(τ)+0.1Rfmt(τ)R_{\mathrm{total}}(\tau) = 0.9 R_{\mathrm{rule}}(\tau) + 0.1 R_{\mathrm{fmt}}(\tau)

The objective is to maximize expected total reward under policy πθ\pi_\theta: J(θ)=E(I,P)DEτπθ(I,P)[Rtotal(τ)]J(\theta) = \mathbb{E}_{(I,P)\sim \mathcal{D}} \,\mathbb{E}_{\tau\sim\pi_\theta(\cdot\mid I,P)} [ R_{\mathrm{total}}(\tau) ]

Training stability is ensured using the standard PPO (Proximal Policy Optimization) surrogate loss: LPPO(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\mathrm{PPO}}(\theta) = -\mathbb{E}_t \left[\min\left( r_t(\theta)\hat{A}_t, \,\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right)\right] where rt(θ)r_t(\theta) is the importance sampling ratio and A^t\hat{A}_t is an advantage estimator based on the combined reward. This setup ensures the learning process tightly couples verification correctness and output structure.

4. Core Verification Capabilities

Systematic ablation studies on object, attribute, spatial, and maze-style data identified three core, or “atomic,” capabilities:

  • Explicit Alignment: Matching between text and directly perceivable image entities (e.g., “red ball”, “three trees”).
  • Relational Verification: Reasoning over object-level or set-level relationships (e.g., “ball above box”, counting).
  • Integrative Reasoning: Holistic, multi-step evaluation over complex scenes, including tasks such as maze-solving or robotics stacking.

Bi-directional transfer was observed between explicit alignment and relational verification via RL training. Integrative reasoning, however, displayed significant domain specificity, necessitating targeted in-domain data for effective generalization.

5. Empirical Results on ViVerBench

ViVerBench is a multimodal verification benchmark comprising 16 subtasks (3,594 samples) encompassing Concept Existence, Object Relations, World Dynamics, Image Annotation, State Evaluation, and STEM-relevant reasoning. Evaluation is performed on two axes:

  • Rule-based accuracy (Accrule\mathrm{Acc}_{\mathrm{rule}}): binary decision correctness.
  • Model-based accuracy (Accmodel\mathrm{Acc}_{\mathrm{model}}): incorporates judgment validity and explanation quality.

Key results:

Model Rule-based Model-based
Qwen 2.5-VL-7B 0.570 0.523
GPT-4o 0.645 0.578
OmniVerifier-7B 0.653 0.559
Human ceiling 0.932 0.932

OmniVerifier-7B demonstrates an 8.3-point gain relative to its backbone (Qwen 2.5-VL-7B), marginally exceeding GPT-4o in rule-based accuracy and closely approaching performance observed in models with substantially larger capacity. Per-category improvements are most pronounced on Explicit Alignment and Relational Verification tasks.

6. Sequential Test-Time Scaling (OmniVerifier-TTS)

OmniVerifier-TTS is a sequential self-refinement paradigm enabling test-time interleaving of generation and verification. Its workflow recursively improves on initial model outputs:

  1. Generate image x0=UMM(P)x_0 = \mathrm{UMM}(P) from prompt.
  2. For iterations t=0,,Tt = 0, \ldots, T:
    • Obtain verification (yt,et)=OmniVerifier-7B(P,xt)(y_t, e_t) = \mathrm{OmniVerifier\text{-}7B}(P, x_t).
    • If yt=truey_t = \text{true} or t=Tt=T, return xtx_t.
    • Else, translate explanation ete_t to an edit prompt dtd_t, and perform edit xt+1=UMM_edit(xt,dt)x_{t+1} = \mathrm{UMM\_edit}(x_t, d_t).

Iterative formulation: xt+1=Gϕ(xt,fθ(P,xt)),fθ(P,xt)=(yt,et,dt)x_{t+1} = G_\phi(x_t,\, f_\theta(P, x_t)), \quad f_\theta(P, x_t) = (y_t, e_t, d_t)

Empirical gains:

Benchmark Model Baseline TTS Δ
T2I-ReasonBench Qwen-Image 55.5 59.2 +3.7
GPT-Image-1 76.8 79.3 +2.5
GenEval++ Qwen-Image 0.675 0.718 +4.3
GPT-Image-1 0.689 0.721 +3.2

OmniVerifier-TTS outperforms parallel Best-of-N selection (with N=10), attaining a higher upper-bound with roughly 47% of the total generative calls.

7. Limitations and Prospects

OmniVerifier-7B exhibits residual generalization gaps, particularly in integrative reasoning domains like maze solving and robotics, which require tailored in-domain data for effective adaptation. Universal verification across all conceivable multimodal tasks remains an open problem. In TTS pipelines, backbone UMMs (e.g., GPT-Image-1) may exhibit style drift over extended edit sequences—primarily color or stylistic artifacts—though verification accuracy persists.

Anticipated future directions include extending the verification framework to additional domains such as video and 3D data, enhancing data augmentation strategies for integrative reasoning, and scaling both model and dataset to further improve generalization and performance (Zhang et al., 15 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniVerifier-7B.