Papers
Topics
Authors
Recent
2000 character limit reached

Self-Improving VLM Judges Without Human Annotations (2512.05145v1)

Published 2 Dec 2025 in cs.CV

Abstract: Effective judges of Vision-LLMs (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.

Summary

  • The paper introduces a self-supervised approach that iteratively refines VLM judges using synthetic preference pairs.
  • The methodology leverages majority voting and synthetic data filtering to achieve significant accuracy gains, with up to a 40.5% improvement on VLRB metrics.
  • The study demonstrates that an 11B parameter VLM judge can rival larger industry models, though challenges remain in safety and generalization.

Self-Improving Vision-LLM Judges Without Human Annotations

Introduction and Motivation

Vision-LLMs (VLMs) have established themselves as critical platforms for multimodal content generation and evaluation. Central to advancing practical usability and alignment is the training of robust reward models (judges) capable of automatic output assessment. Existing paradigms for VLM judge construction depend heavily on large-scale human preference annotation or distillation from proprietary, closed-source models—a process that is both expensive and susceptible to rapid obsolescence given the pace of model advancement. "Self-Improving VLM Judges Without Human Annotations" (2512.05145) introduces an alternate data-centric paradigm: self-supervised judge training via strategically constructed synthetic preference pairs, obviating the need for any human preference supervision.

Methodology: Iterative Synthetic Self-Improvement

The proposed framework consists of a three-stage iterative self-improvement loop: synthetic preference pair generation, judge output filtering with reasoning trace sampling, and judge model fine-tuning. This architecture leverages the VLM's own generations and the current judge's evaluations to drive progressive self-refinement without any external annotation or teacher model. Figure 1

Figure 1: Schematic of the iterative synthetic data generation and judge self-training pipeline for both open- and closed-solution tasks.

Synthetic Data Generation

Task types are split by answer structure: open-ended (e.g., captions, reasoning) and closed-ended (e.g., multiple choice, numerical, short-form responses).

  • Open-ended tasks: The system generates an original response and then synthesizes a "degraded" version by explicitly injecting semantic errors (e.g., altering object attributes, swapping relations, or modifying quantitative details), creating synthetic preference pairs with known label orientation.
  • Closed-ended tasks: The model samples multiple candidate responses; majority voting designates the most common answer as the "preferred" response, paired with a randomly sampled alternative. This exploits the baseline model's own response consistency, avoiding ground-truth label dependence.

Iterative Data Curation and Model Training

After each synthetic batch, the previous-iteration judge model is tasked with evaluating the new preference pairs, generating natural language reasoning traces and binary verdicts. Only cases where the model's verdict aligns with the synthetic preference are retained, and positional bias is mitigated by requiring correct judgments in both response orderings. The curated set of (image, question, pair, reasoning trace, binary label) samples is then used for supervised fine-tuning of the judge model, and the process repeats.

Experimental Results and Quantitative Analysis

Experiments center on Llama-3.2-11B as the base judge, evaluated on VL-RewardBench (Li et al., 26 Nov 2024) and Multimodal RewardBench (Yasunaga et al., 20 Feb 2025). The iterative self-supervised process yields substantial improvements:

  • VLRB overall accuracy increases from 0.38 to 0.54 after 4 iterations, a 40.5% relative gain.
  • MMRB overall accuracy increases from 0.50 to 0.54, a 7.5% gain.
  • On VLRB general instruction following, the 11B judge outperforms both Llama-3.2-90B and Claude-3.5 Sonnet, achieving scores of 0.50+.
  • For hallucination detection and VQA, the model approaches or exceeds much larger systems, despite starting from a much weaker baseline. Figure 2

    Figure 2: Iterative judge training improves VLRB/MMRB metrics, matching or exceeding large closed models after 4 iterations.

Ablation analyses show that majority voting for synthetic data filtering regularly outperforms ground-truth-based filtering (when both are held to equivalent data sizes), suggesting that model-consistency signals are highly effective supervision, and may even more robustly surface reasoning pathologies missed by correctness-only selection. Figure 3

Figure 3: Majority voting for preference pair construction yields consistent improvement over gold-label filtering in reasoning and VQA.

Dimension-Specific Gains, Limitations, and Failure Modes

Performance improvement is not uniform across metrics. VLRB General and Hallucination detection benefit most from iterative self-training, with VLRB General seeing a 69% jump and Hallucination 41%. However, reasoning, safety, and general evaluation on some benchmarks show minimal or non-monotonic improvement. Layered analysis reveals: Figure 4

Figure 4: Large improvement is concentrated on VLRB General and Hallucination; MMRB Safety and General show little advance.

  • Synthetic data methodology (detail alternation for general/hallucination; majority voting for closed-form VQA) is most effective in domains structurally similar to the synthetic errors injected.
  • Safety evaluation is limited by a lack of explicit toxic/biased data in self-generated samples—a natural deficit of the generative pipeline.
  • Diminishing returns are observed after 3-4 iterations, highlighting an inherent limit in self-improvement absent external signals or increased task/image diversity.

Additional analysis of reasoning trace quality across iterations demonstrates not only higher verdict accuracy, but a marked improvement in the granularity and correctness of natural language rationales, supporting the claim that self-improvement cannot be solely attributed to overfitting to superficial signals.

Implications and Future Directions

This work demonstrates that VLM judge models can be iteratively improved using only the model's generative and evaluative capabilities, bypassing human annotation and large-teacher distillation entirely. The strong result that an 11B parameter VLM judge can rival or outperform industry-scale models on established reward benchmarks contradicts the commonly held conviction that human or external teacher signals are indispensable for high-fidelity evaluator training.

From a theoretical angle, these findings reinforce the plausibility of self-bootstrapping reward models in multimodal tasks, highlighting the signal sufficiency of model-consistency heuristics (e.g., majority voting) even in the absence of ground-truth supervision. In practical terms, this opens pathways for efficient, scalable training of reward models for novel, data-scarce, or privacy-restricted domains—so long as the underlying generative models possess sufficient base proficiency for meaningful synthetic contrast generation.

Outstanding open problems include robust safety evaluation under adversarial content, domain generalization to high-diversity distribution shifts, and the integration of self-improving frameworks with mixture-of-experts and routing architectures that can address performance variance across judgment dimensions.

Conclusion

The methods introduced in "Self-Improving VLM Judges Without Human Annotations" push forward the frontier in unsupervised, scalable reward model training for multimodal systems. Iterative self-correction via synthetic preference generation and reasoning-based filtering allows compact VLMs to achieve evaluation performance comparable to closed, much larger models without reliance on costly or quickly outdated human preference signals. The results suggest substantial autonomy for future VLM reward model development, though further research is warranted for safety-critical and highly complex generalization tasks.

Citation: "Self-Improving VLM Judges Without Human Annotations" (2512.05145).

Whiteboard

Explain it Like I'm 14

Explaining “Self-Improving VLM Judges Without Human Annotations”

Overview

This paper is about teaching a computer model to be a fair judge of other computer models’ answers to questions about images. The special part: it learns to judge without any help from humans. Instead, it creates its own practice data and improves itself step by step.

The model is a “VLM judge,” which means it can read text and look at images (Vision-LLM), and then decide which of two answers is better. This helps make other AI systems more trustworthy and more aligned with what people want.

Key Objectives

The paper aims to:

  • Show that a VLM can train itself to be a good judge using only its own generated answers—no human labels needed.
  • Build a simple system to create “practice pairs” of answers where one is better than the other.
  • Improve the judge model over several rounds, using its own reasoning to learn what makes a good evaluation.
  • Test how well this self-trained judge works compared to bigger, more expensive models.

Methods (in everyday language)

Think of training a judge like training a referee who needs lots of practice matches with clear winners and losers. The trick here is to create those practice matches automatically.

The approach has three main parts:

1) Making practice answer pairs with a known “better” choice

There are two types of questions:

  • Closed-ended questions: These have short, clear answers (like picking A/B/C/D or a number).
  • Open-ended questions: These need longer answers (like writing a caption or explaining a scene).

To create training examples:

  • Closed-ended (short answers): The model answers the question many times. The most common answer is treated as the “likely better” one. A different, randomly chosen answer becomes the “worse” one. This is like asking the same question to a crowd and trusting the majority.
  • Open-ended (long answers): The model first writes a normal answer (like a caption). Then it creates a second version with deliberate mistakes—changing details like colors, numbers, or object positions. For example, turning “a red car” into “a blue car” when the car is actually red in the image. The original answer is preferred over the altered one. This builds realistic examples where one answer is clearly better.

This way, the team always knows which answer in the pair should be preferred, without needing human judges.

2) Collecting the judge’s reasoning and filtering it

For each pair, the current judge model explains its decision and picks which answer is better. The system keeps only the cases where the judge agrees with the known preferred choice. It also checks for “position bias” (some models just prefer the first answer), so it swaps the order and only keeps cases where the judge is correct both ways. This helps make sure the judge is learning based on solid reasoning, not lucky guessing.

3) Training the judge and repeating

The judge is fine-tuned (trained) on these filtered examples and the reasoning it produced. Then the whole process repeats several times, each time with the improved judge generating better reasoning and better training data. This is like a coach who reviews only the correct plays to teach better habits, over multiple practice rounds.

Main Findings and Why They Matter

The researchers trained a relatively small model (Llama-3.2-11B Vision) to be a judge using this method and tested it on standard benchmarks:

  • On VL-RewardBench (VLRB), the judge’s accuracy rose from about 0.38 to about 0.54 after four iterations. That’s a big improvement, and it sometimes beat much larger models, including a 90B-parameter version and a popular commercial model (Claude 3.5 Sonnet), especially on general instruction-following.
  • On Multimodal RewardBench (MMRB), the judge went from about 0.50 to about 0.54. It improved notably on VQA (Visual Question Answering) and stayed competitive on other tasks.

What improved most:

  • General instruction-following and hallucination detection (spotting when a model mentions things not in the image) had strong gains. This means the judge got better at preferring answers that are grounded in what’s actually visible.
  • VQA (short answers about images) also improved, showing the “majority vote” trick works well.

Where improvements were smaller:

  • Safety-related judgments (detecting harmful or inappropriate content) improved only a little because the training didn’t include special safety-focused examples.
  • Some reasoning tasks peaked around the third iteration and then leveled off, suggesting too many rounds can lead to overfitting.

Why this matters:

  • It shows you don’t need expensive human labels to train a useful judge.
  • A smaller, cheaper model can learn to evaluate well enough to rival bigger, pricier systems on some tasks.
  • The method works on new image collections where there are no correct answers provided.

Implications and Potential Impact

  • Cheaper and faster training: Because it needs no human annotations, this approach can be used widely to build judges for many tasks.
  • Useful for new domains: It can handle brand-new image sets or topics where ground-truth answers don’t exist.
  • Better alignment: As a judge gets better, it can help improve other models by rewarding accurate, well-grounded answers and discouraging hallucinations.

Future directions:

  • Safety: To truly improve safety judgments, the method should include synthetic examples designed to test for bias, toxicity, or policy violations.
  • Diversity: Using more varied images and tasks could help the judge generalize better across different visual domains.
  • Specialized judges: Building multiple “expert” judges for specific skills (like reasoning, safety, or factual accuracy) and routing tasks to the right expert could boost performance further.

Overall, this paper shows a practical, clever way to train a strong multimodal judge using self-generated data, making high-quality AI evaluation more accessible and scalable.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 220 likes about this paper.

HackerNews