Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability (2506.15629v1)

Published 18 Jun 2025 in cs.CL and cs.AI

Abstract: In generative commonsense reasoning tasks such as CommonGen, generative LLMs compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

Summary

Evaluation of Compositional Generalization and Instruction-Following in LLMs

LLMs have achieved significant advancements in generative commonsense reasoning tasks, demonstrating competence in forming coherent sentences through integrative application of provided concepts. However, when tasked with adhering to specified concept orderings within problem prompts, LLMs exhibit notable limitations. Addressing these concerns, the paper "Revisiting Compositional Generalization Capability of LLMs Considering Instruction Following Ability" by Sakai et al. introduces Ordered CommonGen—a benchmark specifically designed to probe and evaluate the dual capabilities of compositional generalization and instruction-following in LLMs.

Ordered CommonGen: A New Benchmark

Ordered CommonGen builds on the traditional CommonGen framework, which evaluates the ability of LLMs to generate sentences inclusive of provided concepts. However, Ordered CommonGen distinguishes itself by introducing a requirement for LLMs to generate sentences that maintain a specified order of concepts. This adjustment allows for concurrent evaluation of both compositional generalization and instruction-following. The benchmark tests the ordered coverage of LLM outputs—essentially measuring each model's ability to strictly adhere to the instructed concept sequence.

Key Findings and Results

Employing this novel framework, the authors subjected 36 LLMs to a rigorous analysis. The findings revealed a crucial insight: while LLMs largely comprehend the intent behind order-specific instructions, they often default to producing outputs biased toward common concept order patterns. Such behavior typically culminates in outputs with reduced diversity. On average, even the most compliant models achieve no greater than roughly 75% ordered coverage, suggesting significant room for improvement.

From a quantitative standpoint, the Ordered CommonGen task forced LLMs to diverge from natural ordering—often habituated through extensive training data—resulting in elevated sentence diversity and uncovering gaps in instruction compliance. Additionally, the paper confirms that LLMs, while capable of composing sentences inclusive of all concepts, struggle with maintaining prescribed sequential orders. This deficiency is exacerbated among more intricate verb-only sentence structures, a pattern consistent with known challenges in semantic processing.

Implications and Future Directions

The implications of these findings extend broadly across various applications of LLMs. In practical contexts, particularly those necessitating chronological or semantically precise ordering (e.g., event narration, recipe formulation, legal documentation), the enhanced instruction-following capabilities are critical. For theoretical advancements, this research underscores an important gap in current LLM architecture and training, highlighting areas ripe for improvement via targeted model refinements and training paradigms.

Future research directions might include developing novel training regimes that incorporate structured or sequential data tasks, which could potentially bolster LLMs’ adherence to explicit instructions. Further exploration of architecture-specific biases toward certain concept sequences could illuminate understanding of underlying computational peculiarities influencing instruction compliance capabilities.