Safety evaluation in self-improving VLM judge frameworks
Develop principled approaches for safely generating and learning from examples of harmful content within self-improving vision-language model judge training frameworks that construct synthetic preference data and iteratively fine-tune judges, ensuring risks are not amplified while enabling effective safety evaluation capabilities.
Sponsor
References
Developing effective safety evaluation capabilities within self-improving frameworks remains an important open challenge, one that will require principled approaches for safely generating and learning from examples of harmful content without amplifying risks.
— Self-Improving VLM Judges Without Human Annotations
(2512.05145 - Lin et al., 2 Dec 2025) in Section: Limitations — Safety Evaluation Limitations