Verifying whether majority voting eliminates imperfect reasoning in self-improving VLM judges

Ascertain whether majority voting–based filtering of synthetic preference pairs for closed-ended tasks in the self-improving vision-language model judge training framework eliminates all imperfect reasoning in the judge’s decisions.

Background

The method constructs preference pairs for closed-ended tasks by sampling multiple responses from the base VLM and selecting the majority answer as the preferred option, then uses the current judge model to filter and train on reasoning traces that align with these synthetic preferences.

In the appendix, the authors present a counterexample where the judge selects the correct answer but provides flawed reasoning, and they explicitly state that they cannot conclusively verify that majority voting eliminates all imperfect reasoning, highlighting a remaining uncertainty about the reasoning robustness induced by this filtering strategy.

References

While we cannot conclusively verify that majority voting eliminates all imperfect reasoning, the empirical advantages shown in Section~\ref{sec:analysis_reasoning} suggest that consistency-based filtering may provide more robust supervision than correctness checking alone for learning generalizable judgment criteria.

Self-Improving VLM Judges Without Human Annotations (2512.05145 - Lin et al., 2 Dec 2025) in Appendix: Correctness Filter Negative Example