Generalization of instruction-count performance relationship to complex instruction types
Determine whether the observed degradation in instruction-following performance as the number of simultaneous instructions increases—measured on the ManyIFEval (text generation with up to 10 instructions) and StyleMBPP (code generation with up to 6 style instructions) benchmarks—persists for more complex instruction types not covered in these benchmarks, including semantic instructions, conditional logic (if/then rules), and multi-step procedural requirements.
References
Although our study has systematically analyzed multiple-instructions-following ability of LLMs, several important questions remain for future work. First, whether similar relationships between instruction count and performance hold for more complex instruction types not covered in our benchmarks, such as semantic instructions, conditional logic, or multi-step procedures.