OmniVerifier-7B: Universal Visual Verifier
- OmniVerifier-7B is a 7B-parameter generative visual verifier that uses reinforcement learning to generate binary judgments, chain-of-thought, and detailed explanations from image-prompt pairs.
- It leverages a unified vision–language transformer architecture with a frozen ViT-style encoder and a transformer decoder to process and verify multimodal inputs.
- Its sequential test-time scaling refines initial outputs iteratively, significantly boosting performance on explicit alignment and relational verification tasks.
OmniVerifier-7B is a 7 billion-parameter generative visual verifier designed to function as a universal visual verification engine for unified multimodal reasoning and generation. Trained on a balanced corpus of true/false image-prompt pairs with reinforcement learning (RL), it produces structured verification outputs—binary judgments, chain-of-thought, detailed explanations, and, in sequential usage, edit instructions—enabling reflection, refinement, and increased reliability in complex multimodal models. It constitutes the first omni-capable, generative verifier designed for robust, end-to-end evaluation and optimization of visual outcomes, supporting both reasoning and stepwise improvement in contemporary vision-language systems (Zhang et al., 15 Oct 2025).
1. Model Architecture and Modalities
OmniVerifier-7B is based on Qwen2.5-VL-7B, a unified vision–language transformer architecture comprising 7 billion parameters. Its vision encoder is a frozen ViT-style module, projecting input RGB images into embedding tokens without further in-model adaptation. The language decoder stack (transformer layers with cross-attention to visual embeddings) processes concatenated image and prompt inputs.
Accepted inputs per instance are:
- An RGB image,
- A natural-language prompt specifying the verification query.
The model generates output in the following structured format:
- a binary “true”/“false” judgment,
- a short chain-of-thought justification,
- when the answer is “false,” a natural-language explanation, and
- in test-time sequential (TTS) usage, an edit instruction derived from the failed case.
Distinctively, OmniVerifier-7B introduces no new architectural components or adapters beyond the Qwen2.5-VL-7B backbone. The key innovation is RL-based fine-tuning to endow the backbone with generative verification ability on image-prompt pairs.
2. Data Curation and Automated Training Pipelines
Large-scale, high-quality verification data—specifically, complex image-prompt true/false pairs—were curated using two automated pipelines applied to a dataset comprising 20,000 LVIS natural images and 20,000 Seedream 3.0 synthetic renderings.
Automated Verification Pipelines
- Image-Fixed, Prompt-Modified Pipeline:
- Generate a faithful prompt referencing only visually verifiable elements using GPT-5, labeling the resulting (image, prompt) as a true instance.
- Modify the prompt automatically using GPT-5 (object/attribute/relation edits), generating a matching explanation to yield false-instance pairs.
- Prompt-Fixed, Image-Inpainting Pipeline:
- Use SAM 2.1 to segment images by object masks, selecting segmentation difficulty via object area.
- Maintain the original prompt, flagging focus bounding boxes in text.
- Utilize FLUX.1-dev to inpaint or remove objects, yielding a false image while keeping the prompt fixed.
All generated pairs are subjected to a filtering stage: Seed 1.5-VL model in Best-of-10 voting. Only samples with at least 60% verification accuracy are retained. This produces a final, finely balanced dataset of approximately 28,000 true/false pairs, covering explicit alignment and relational verification scenarios.
3. Training Objectives and Reinforcement Learning
OmniVerifier-7B employs the DAPO RL system with rewards tailored for verification accuracy and output formatting:
- Rule-based reward : set to 1 if the binary answer matches ground truth, else 0.
- Format reward : set to 1 if the model output matches the prescribed (JSON + explanation) format.
Total reward is a weighted combination:
The objective is to maximize expected total reward under policy :
Training stability is ensured using the standard PPO (Proximal Policy Optimization) surrogate loss: where is the importance sampling ratio and is an advantage estimator based on the combined reward. This setup ensures the learning process tightly couples verification correctness and output structure.
4. Core Verification Capabilities
Systematic ablation studies on object, attribute, spatial, and maze-style data identified three core, or “atomic,” capabilities:
- Explicit Alignment: Matching between text and directly perceivable image entities (e.g., “red ball”, “three trees”).
- Relational Verification: Reasoning over object-level or set-level relationships (e.g., “ball above box”, counting).
- Integrative Reasoning: Holistic, multi-step evaluation over complex scenes, including tasks such as maze-solving or robotics stacking.
Bi-directional transfer was observed between explicit alignment and relational verification via RL training. Integrative reasoning, however, displayed significant domain specificity, necessitating targeted in-domain data for effective generalization.
5. Empirical Results on ViVerBench
ViVerBench is a multimodal verification benchmark comprising 16 subtasks (3,594 samples) encompassing Concept Existence, Object Relations, World Dynamics, Image Annotation, State Evaluation, and STEM-relevant reasoning. Evaluation is performed on two axes:
- Rule-based accuracy (): binary decision correctness.
- Model-based accuracy (): incorporates judgment validity and explanation quality.
Key results:
| Model | Rule-based | Model-based |
|---|---|---|
| Qwen 2.5-VL-7B | 0.570 | 0.523 |
| GPT-4o | 0.645 | 0.578 |
| OmniVerifier-7B | 0.653 | 0.559 |
| Human ceiling | 0.932 | 0.932 |
OmniVerifier-7B demonstrates an 8.3-point gain relative to its backbone (Qwen 2.5-VL-7B), marginally exceeding GPT-4o in rule-based accuracy and closely approaching performance observed in models with substantially larger capacity. Per-category improvements are most pronounced on Explicit Alignment and Relational Verification tasks.
6. Sequential Test-Time Scaling (OmniVerifier-TTS)
OmniVerifier-TTS is a sequential self-refinement paradigm enabling test-time interleaving of generation and verification. Its workflow recursively improves on initial model outputs:
- Generate image from prompt.
- For iterations :
- Obtain verification .
- If or , return .
- Else, translate explanation to an edit prompt , and perform edit .
Iterative formulation:
Empirical gains:
| Benchmark | Model | Baseline | TTS | Δ |
|---|---|---|---|---|
| T2I-ReasonBench | Qwen-Image | 55.5 | 59.2 | +3.7 |
| GPT-Image-1 | 76.8 | 79.3 | +2.5 | |
| GenEval++ | Qwen-Image | 0.675 | 0.718 | +4.3 |
| GPT-Image-1 | 0.689 | 0.721 | +3.2 |
OmniVerifier-TTS outperforms parallel Best-of-N selection (with N=10), attaining a higher upper-bound with roughly 47% of the total generative calls.
7. Limitations and Prospects
OmniVerifier-7B exhibits residual generalization gaps, particularly in integrative reasoning domains like maze solving and robotics, which require tailored in-domain data for effective adaptation. Universal verification across all conceivable multimodal tasks remains an open problem. In TTS pipelines, backbone UMMs (e.g., GPT-Image-1) may exhibit style drift over extended edit sequences—primarily color or stylistic artifacts—though verification accuracy persists.
Anticipated future directions include extending the verification framework to additional domains such as video and 3D data, enhancing data augmentation strategies for integrative reasoning, and scaling both model and dataset to further improve generalization and performance (Zhang et al., 15 Oct 2025).