Empirical validation of PACIFIC’s contamination resistance

Establish, through systematic empirical evaluation across diverse benchmark configurations and usage scenarios, whether the contamination resistance property of the PACIFIC benchmark-generation framework holds and to what extent it effectively mitigates training data contamination (e.g., via seed-based resampling and representation diversity).

Background

PACIFIC is designed to generate scalable and contamination-resilient benchmarks for evaluating instruction-following and code dry-running capabilities in LLMs. The framework claims contamination resistance via mechanisms such as seed-based resampling and representation diversity (e.g., code vs. natural language, prompt vs. chat), enabling unseen variants while preserving difficulty.

However, the paper explicitly notes that contamination resistance has not been empirically validated across all scenarios. Confirming this property requires a systematic evaluation to determine whether PACIFIC’s design indeed prevents memorization or training-data leakage effects across different models, configurations, and benchmark variants.

References

Contamination resistance, a core design objective of PACIFIC, has not yet been empirically validated across all scenarios. While the framework’s construction is intended to minimize contamination risk, confirming this property through systematic evaluation remains an important direction for future work.

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code  (2512.10713 - Dreyfuss et al., 11 Dec 2025) in Section 7, Threats to validity