VALOR: Annotation-Free Visual Reasoning
- VALOR is an annotation-free training framework for visual reasoning that uses dual LLM and VLM verifiers to optimize logical spatial reasoning and object grounding.
- It employs a modular architecture where an LLM provides reward-based fine-tuning while a VLM filters detections to generate automated pseudo-labels.
- The framework achieves strong empirical performance across spatial benchmarks by decoupling reinforcement learning for planning from supervised fine-tuning for precise detection.
VALOR is an annotation-free training framework for visual reasoning that leverages multimodal verifiers to simultaneously optimize language-based spatial reasoning and visual grounding. It is specifically designed to address the entangled challenges of precise object grounding and robust logical reasoning in spatial tasks, achieving strong empirical performance without reliance on manually annotated supervision. The methodology utilizes LLMs and vision-LLMs (VLMs) as frozen verifiers to generate automated rewards and pseudo-labels, thereby enabling efficient reinforcement learning and supervised detector tuning. This paradigm enables performance and data scaling, modular specialization, and systematic error analysis across a wide spectrum of spatial reasoning benchmarks (Marsili et al., 9 Dec 2025).
1. Rationale and Paradigm Shift
Traditional visual reasoning pipelines depend on expansive datasets comprising labeled (image, query, answer) tuples. Such datasets require intensive expert effort to annotate spatial ground-truth, bounding boxes, and logic traces. Prior language-only chain-of-thought methods rely on annotated answer supervision, while program-synthesis approaches use frozen models but suffer from improper logic and weak grounding.
VALOR introduces an annotation-free regime with two key innovations:
- LLM Verifier: A frozen, high-capacity LLM acts as a reward generator for RL-based fine-tuning, providing binary feedback across several aspects of reasoning (decomposition, API tool calls, logical steps).
- VLM Verifier: A frozen VLM automates hard-negative mining and validates object grounding by filtering over-predicted detections, generating pseudo-labels for the detector without manual box annotations.
This hybrid verifier strategy targets the main failure modes—logic errors and misgrounded objects—thereby obviating the need for ground-truth supervision in both reasoning and grounding.
2. System Architecture
VALOR comprises four principal modules:
| Module | Role | Model(s) |
|---|---|---|
| Reasoning LLM | Policy over plans and Python code tool calls | Qwen3-8B (fine-tuned) |
| Vision Specialists | API suite for object detection, depth, appearance queries | GroundingDINO-T, MoGe2, GPT-5-mini |
| LLM Verifier | RL reward model giving multi-headed binary feedback | Gemini-2.5-Flash |
| VLM Verifier | Visual grounding critic; filters and pseudo-labels detections | GPT-5-mini |
- Reasoning Module: Accepts queries , outputs natural language plans and corresponding Python code invoking vision APIs (gd_detect for object detection, depth for metric depth, vqa for appearance/existence).
- Vision Specialists: Enable modular decomposition of complex spatial tasks through explicit API calls.
- LLM Verifier: Supplies six binary reward heads—format, syntax, logic, attribute, spatial, adherence—aggregated into a scalar RL reward. Program traces and tool calls are checked for coherence, correctness, and consistency.
- VLM Verifier: Ingests aggressive object proposals from a low-threshold detector. Employs a three-stage filtering (coarse, crop-level, de-duplication) to derive positive and hard-negative boxes as detector pseudo-labels.
The architecture is modular. Specialists and verifiers are replaceable and extensible.
3. Training Algorithm and Objectives
VALOR utilizes a two-phase, largely decoupled training schedule:
Phase I: LLM RL Fine-Tuning—Group Relative Policy Optimization (GRPO)
- Policy generates multiple (plan, code) trajectories per query.
- Verifier reward aggregates across the six heads:
- Policy update via GRPO:
with relative advantages and importance weights.
Phase II: Detector SFT via VLM Verifier
- Final (or intermediate) is queried for grounding prompts.
- Over-predicted detector boxes are filtered by the VLM verifier.
- Pseudo-labels (positives, hard negatives) are used for supervised fine-tuning of GroundingDINO with mixed regression and entropy loss.
Sequential alternation ensures that reasoning improvements cascade to grounding quality. In future work, tighter interleaving or closed-loop RL/detection cycles are proposed.
4. Implementation Specifics
- Base models standardized: Qwen3-8B for reasoning, GroundingDINO-T (Objects365+GLiP) for detection, MoGe2 for depth, Gemini-2.5-Flash for reward modeling.
- Prompting: System and reward prompts enforce plan/answer tags, 3D reasoning, and API signature templates. Verifier prompts specify logic consistency, spatial relationships, and adherence.
- Hard-Negative Mining: Lowered detector confidence thresholds yield high recall. VLM verifier prunes false positives based on semantic agreement. Hard negatives form the backbone of supervised correction in grounding.
- Hyperparameters: , for GRPO; batch sizes, reward head weights, and detection parameters given in §A.13.
5. Empirical Evaluation and Quantitative Analysis
VALOR is tested on a suite of spatial benchmarks:
| Task | Modalities | Benchmark Datasets |
|---|---|---|
| 3D reasoning | Vision, language | Omni3D-Bench, RoboSpatial, BLINK, VSR, RealWorldQA |
| 2D appearance | Vision, language | GQA, TallyQA, CountBenchQA |
- Comparative Baselines: LLM tool-users (GPT-4o, Gemini, Llama3), RL-tuned VLMs (GRIT, ViGoRL), program synthesis (VisProg, ViperGPT), direct-answer VLMs.
- Results: LLM RL phase gives +6.4% increase on Omni3D, +3.4% on BLINK over Qwen3-8B; full pipeline further improves RoboSpatial (+7.7%), CountBenchQA (+8.3%). Outperforms RL-tuned VLMs and program synthesis on reasoning-dominated tasks.
- Ablations: Removing spatial reward head degrades VSR by –10%; removing logic by –8% in Omni3D. Grounding SFT without RL is clearly inferior; data scaling improves performance with more queries/pseudo-labels.
6. Qualitative Error Analysis and Model Scalability
- Success Case: Hypothetical 3D queries (e.g., inferred coffee table height) are solved by VALOR through explicit metric depth computation and rescaling, outclassing 2D-only models.
- Error Modes:
- Reasoning: Simplification of directional spatial relations may result in under-specification.
- Grounding: Small object detection fails under aggressive VLM filtering, leading to under-prediction or duplication.
- Verifier Reliability: Empirical verifier disagreement ~13% (LLM), VLM verifier precision ~75%. Verifier error is a principal bottleneck.
VALOR demonstrates robust data scalability—the accuracy positively correlates with RL/query/task volume and number of pseudo-labeled detection boxes. Annotation-free paradigm is supported by unsupervised reward scale up.
7. Framework Extensibility and Limitations
Strengths:
- Zero manual annotation requirement for both logic and grounding.
- Modular design—verifiers and specialists are swappable to accommodate future advances.
- Empirical scaling with input queries and pseudo-labels.
Limitations:
- Verifier-induced errors and bias (especially VLM precision, LLM reward disagreement).
- Policy LLM scaling—complex hypothetical or indirect queries remain a challenge.
- Bias from pseudo-label seeds; rare object categories may be omitted by hard-negative mining.
Proposed Directions:
- Joint/interleaved RL and grounding updates for more expressive training dynamics.
- Reasoning-centric hard-negative mining to systematically probe RL model weaknesses.
- Expansion to non-VLM verifiers (e.g., physics engines, geometric simulators) for richer reward signals.
VALOR offers a minimal-label, scalable, modular, and empirically validated pipeline for advanced spatial visual reasoning, pushing boundaries of program synthesis and LLM-VLM combined architectures (Marsili et al., 9 Dec 2025).