Cause of performance degradation with large prompt generation sizes
Determine the underlying mechanisms that cause performance to plateau or degrade when BiomedXPro’s evolutionary prompt optimization generates a large number of prompt pairs per mutation step (e.g., K_t = 50), compared to moderate or small generation sizes (e.g., K_t = 10 or K_t = 5), and characterize conditions under which larger generation sizes can be employed without loss of performance in the BiomedCLIP-based evaluation pipeline.
References
As shown in \cref{fig:generation_size_ablation}, generating only 5 pairs leads to slow convergence, while a large set of 50 causes performance to plateau quickly after a strong start. While the precise reason for this degradation with a large generation size is unclear, this finding is consistent with prior work by Yang \etal., who also observed that a moderate number of generated instructions was optimal.