Safety evaluation in self-improving VLM judge frameworks

Develop principled approaches for safely generating and learning from examples of harmful content within self-improving vision-language model judge training frameworks that construct synthetic preference data and iteratively fine-tune judges, ensuring risks are not amplified while enabling effective safety evaluation capabilities.

Background

The paper introduces a self-improving framework for training vision-LLM (VLM) judges using synthetic preference pairs and iterative fine-tuning without human annotations. While the approach improves performance across several benchmarks, the authors note modest gains on safety evaluation tasks and explicitly avoid generating biased or toxic content during training.

In the Limitations section, the authors emphasize that robust safety evaluation typically requires specialized infrastructure, such as red-teaming and adversarial data, and they explicitly state that developing effective safety capabilities within self-improving frameworks is an open challenge that must ensure safe handling of harmful content without amplifying risks.

References

Developing effective safety evaluation capabilities within self-improving frameworks remains an important open challenge, one that will require principled approaches for safely generating and learning from examples of harmful content without amplifying risks.

Self-Improving VLM Judges Without Human Annotations (2512.05145 - Lin et al., 2 Dec 2025) in Section: Limitations — Safety Evaluation Limitations