Cause of performance degradation with large prompt generation sizes

Determine the underlying mechanisms that cause performance to plateau or degrade when BiomedXPro’s evolutionary prompt optimization generates a large number of prompt pairs per mutation step (e.g., K_t = 50), compared to moderate or small generation sizes (e.g., K_t = 10 or K_t = 5), and characterize conditions under which larger generation sizes can be employed without loss of performance in the BiomedCLIP-based evaluation pipeline.

Background

In the ablation analysis of BiomedXPro, the authors varied the number of prompt pairs produced per mutation iteration (K_t) and observed distinct convergence behaviors: generating only 5 pairs slowed convergence, whereas generating 50 pairs led to an early strong start followed by performance plateauing. A moderate setting (K_t = 10) provided the most stable improvements.

The paper notes that the precise reason for this degradation with larger generation sizes is not understood. Since BiomedXPro relies on an LLM to produce candidate prompts and on a VLM (BiomedCLIP) to score them, several interacting factors (e.g., LLM recency bias, prompt quality variance, selection pressure, or evaluation noise) could contribute to the observed plateau. Clarifying these mechanisms would guide principled choices of K_t and potentially enable scaling without sacrificing performance.

References

As shown in \cref{fig:generation_size_ablation}, generating only 5 pairs leads to slow convergence, while a large set of 50 causes performance to plateau quickly after a strong start. While the precise reason for this degradation with a large generation size is unclear, this finding is consistent with prior work by Yang \etal., who also observed that a moderate number of generated instructions was optimal.

— BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models (2510.15866 - Silva et al., 17 Oct 2025) in Ablation studies: Impact of generation size per iteration

Cause of performance degradation with large prompt generation sizes

Background

References

Related Problems