VALOR: Annotation-Free Visual Reasoning

Updated 5 January 2026

VALOR is an annotation-free training framework for visual reasoning that uses dual LLM and VLM verifiers to optimize logical spatial reasoning and object grounding.
It employs a modular architecture where an LLM provides reward-based fine-tuning while a VLM filters detections to generate automated pseudo-labels.
The framework achieves strong empirical performance across spatial benchmarks by decoupling reinforcement learning for planning from supervised fine-tuning for precise detection.

VALOR is an annotation-free training framework for visual reasoning that leverages multimodal verifiers to simultaneously optimize language-based spatial reasoning and visual grounding. It is specifically designed to address the entangled challenges of precise object grounding and robust logical reasoning in spatial tasks, achieving strong empirical performance without reliance on manually annotated supervision. The methodology utilizes LLMs and vision-LLMs (VLMs) as frozen verifiers to generate automated rewards and pseudo-labels, thereby enabling efficient reinforcement learning and supervised detector tuning. This paradigm enables performance and data scaling, modular specialization, and systematic error analysis across a wide spectrum of spatial reasoning benchmarks (Marsili et al., 9 Dec 2025).

1. Rationale and Paradigm Shift

Traditional visual reasoning pipelines depend on expansive datasets comprising labeled (image, query, answer) tuples. Such datasets require intensive expert effort to annotate spatial ground-truth, bounding boxes, and logic traces. Prior language-only chain-of-thought methods rely on annotated answer supervision, while program-synthesis approaches use frozen models but suffer from improper logic and weak grounding.

VALOR introduces an annotation-free regime with two key innovations:

LLM Verifier: A frozen, high-capacity LLM acts as a reward generator for RL-based fine-tuning, providing binary feedback across several aspects of reasoning (decomposition, API tool calls, logical steps).
VLM Verifier: A frozen VLM automates hard-negative mining and validates object grounding by filtering over-predicted detections, generating pseudo-labels for the detector without manual box annotations.

This hybrid verifier strategy targets the main failure modes—logic errors and misgrounded objects—thereby obviating the need for ground-truth supervision in both reasoning and grounding.

2. System Architecture

VALOR comprises four principal modules:

Module	Role	Model(s)
Reasoning LLM	Policy over plans and Python code tool calls	Qwen3-8B (fine-tuned)
Vision Specialists	API suite for object detection, depth, appearance queries	GroundingDINO-T, MoGe2, GPT-5-mini
LLM Verifier	RL reward model giving multi-headed binary feedback	Gemini-2.5-Flash
VLM Verifier	Visual grounding critic; filters and pseudo-labels detections	GPT-5-mini

Reasoning Module: Accepts queries $q$ , outputs natural language plans $p$ and corresponding Python code $c$ invoking vision APIs (gd_detect for object detection, depth for metric depth, vqa for appearance/existence).
Vision Specialists: Enable modular decomposition of complex spatial tasks through explicit API calls.
LLM Verifier: Supplies six binary reward heads—format, syntax, logic, attribute, spatial, adherence—aggregated into a scalar RL reward. Program traces and tool calls are checked for coherence, correctness, and consistency.
VLM Verifier: Ingests aggressive object proposals from a low-threshold detector. Employs a three-stage filtering (coarse, crop-level, de-duplication) to derive positive and hard-negative boxes as detector pseudo-labels.

The architecture is modular. Specialists and verifiers are replaceable and extensible.

3. Training Algorithm and Objectives

VALOR utilizes a two-phase, largely decoupled training schedule:

Phase I: LLM RL Fine-Tuning—Group Relative Policy Optimization (GRPO)

Policy $\pi_\theta$ generates multiple (plan, code) trajectories per query.
Verifier reward $R(q,p,c)$ aggregates across the six heads:

$R(q,p,c) = r_{\text{fmt}}(p,c) \cdot [\lambda_{\text{sn}} r_{\text{sn}}(c) + \lambda_{\text{log}} r_{\text{log}}(q,p) + \lambda_{\text{att}} r_{\text{att}}(q,p) + \lambda_{\text{sp}} r_{\text{sp}}(q,p) + \lambda_{\text{ad}} r_{\text{ad}}(p,c)]$

Policy update via GRPO:

$L_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_i [ \min(s^{(i)} A^{(i)}, \text{clip}(s^{(i)}, 1-\epsilon, 1+\epsilon)A^{(i)}) - \beta KL[\pi_\theta \parallel \pi_{\text{ref}}] ]$

with $A^{(i)}$ relative advantages and $s^{(i)}$ importance weights.

Phase II: Detector SFT via VLM Verifier

Final (or intermediate) $\pi_\theta$ is queried for grounding prompts.
Over-predicted detector boxes are filtered by the VLM verifier.
Pseudo-labels (positives, hard negatives) are used for supervised fine-tuning of GroundingDINO with mixed regression and entropy loss.

Sequential alternation ensures that reasoning improvements cascade to grounding quality. In future work, tighter interleaving or closed-loop RL/detection cycles are proposed.

4. Implementation Specifics

Base models standardized: Qwen3-8B for reasoning, GroundingDINO-T (Objects365+GLiP) for detection, MoGe2 for depth, Gemini-2.5-Flash for reward modeling.
Prompting: System and reward prompts enforce plan/answer tags, 3D reasoning, and API signature templates. Verifier prompts specify logic consistency, spatial relationships, and adherence.
Hard-Negative Mining: Lowered detector confidence thresholds yield high recall. VLM verifier prunes false positives based on semantic agreement. Hard negatives form the backbone of supervised correction in grounding.
Hyperparameters: $\epsilon=0.2$ , $\beta=0.01$ for GRPO; batch sizes, reward head weights, and detection parameters given in §A.13.

5. Empirical Evaluation and Quantitative Analysis

VALOR is tested on a suite of spatial benchmarks:

Task	Modalities	Benchmark Datasets
3D reasoning	Vision, language	Omni3D-Bench, RoboSpatial, BLINK, VSR, RealWorldQA
2D appearance	Vision, language	GQA, TallyQA, CountBenchQA

Comparative Baselines: LLM tool-users (GPT-4o, Gemini, Llama3), RL-tuned VLMs (GRIT, ViGoRL), program synthesis (VisProg, ViperGPT), direct-answer VLMs.
Results: LLM RL phase gives +6.4% increase on Omni3D, +3.4% on BLINK over Qwen3-8B; full pipeline further improves RoboSpatial (+7.7%), CountBenchQA (+8.3%). Outperforms RL-tuned VLMs and program synthesis on reasoning-dominated tasks.
Ablations: Removing spatial reward head degrades VSR by –10%; removing logic by –8% in Omni3D. Grounding SFT without RL is clearly inferior; data scaling improves performance with more queries/pseudo-labels.

6. Qualitative Error Analysis and Model Scalability

Success Case: Hypothetical 3D queries (e.g., inferred coffee table height) are solved by VALOR through explicit metric depth computation and rescaling, outclassing 2D-only models.
Error Modes:
- Reasoning: Simplification of directional spatial relations may result in under-specification.
- Grounding: Small object detection fails under aggressive VLM filtering, leading to under-prediction or duplication.
Verifier Reliability: Empirical verifier disagreement ~13% (LLM), VLM verifier precision ~75%. Verifier error is a principal bottleneck.

VALOR demonstrates robust data scalability—the accuracy positively correlates with RL/query/task volume and number of pseudo-labeled detection boxes. Annotation-free paradigm is supported by unsupervised reward scale up.

7. Framework Extensibility and Limitations

Strengths:

Zero manual annotation requirement for both logic and grounding.
Modular design—verifiers and specialists are swappable to accommodate future advances.
Empirical scaling with input queries and pseudo-labels.

Limitations:

Verifier-induced errors and bias (especially VLM precision, LLM reward disagreement).
Policy LLM scaling—complex hypothetical or indirect queries remain a challenge.
Bias from pseudo-label seeds; rare object categories may be omitted by hard-negative mining.

Proposed Directions:

Joint/interleaved RL and grounding updates for more expressive training dynamics.
Reasoning-centric hard-negative mining to systematically probe RL model weaknesses.
Expansion to non-VLM verifiers (e.g., physics engines, geometric simulators) for richer reward signals.

VALOR offers a minimal-label, scalable, modular, and empirically validated pipeline for advanced spatial visual reasoning, pushing boundaries of program synthesis and LLM-VLM combined architectures (Marsili et al., 9 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VALOR Framework.

VALOR: Annotation-Free Visual Reasoning

1. Rationale and Paradigm Shift

2. System Architecture

3. Training Algorithm and Objectives

4. Implementation Specifics

5. Empirical Evaluation and Quantitative Analysis

6. Qualitative Error Analysis and Model Scalability

7. Framework Extensibility and Limitations

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VALOR: Annotation-Free Visual Reasoning

1. Rationale and Paradigm Shift

2. System Architecture

3. Training Algorithm and Objectives

4. Implementation Specifics

5. Empirical Evaluation and Quantitative Analysis

6. Qualitative Error Analysis and Model Scalability

7. Framework Extensibility and Limitations

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research