KOR-Bench: Measuring Rule Generalization
- KOR-Bench is a benchmark suite that isolates language models' intrinsic reasoning by using tasks crafted to minimize pretraining data influence.
- It categorizes tasks into operation, logic, cipher, puzzle, and counterfactual to rigorously evaluate rule induction and dynamic generalization.
- The evaluation employs stepwise prompting and ablation studies to identify performance gaps and guide future multimodal extensions.
KOR-Bench is a benchmark suite explicitly formulated for evaluating LLMs' reasoning abilities in contexts that are deliberately constructed to be independent of domain-specific pretraining knowledge. The primary goal is to systematically minimize models' reliance on memorized or training data–derived knowledge in order to measure reasoning capacities in out-of-distribution settings. KOR-Bench introduces a collection of tasks that require applying newly described rules to solve problems, thereby emphasizing the capacity for dynamic rule generalization and adaptation rather than recall. Its architecture, methodology, and adoption of knowledge-orthogonality represent a significant methodological shift in the assessment of advanced LLMs.
1. Conceptual Basis: Knowledge-Orthogonal Reasoning
Knowledge-Orthogonal Reasoning (KOR) is an evaluation paradigm designed to isolate and scrutinize the intrinsic reasoning faculties of LLMs. It addresses the confounding effect wherein large pretrained models perform well on traditional reasoning benchmarks by leveraging prior exposure to similar tasks or facts during pretraining. In KOR, tasks are constructed such that success cannot be attributed to memorized patterns; instead, inference is possible only through the correct, on-the-fly interpretation and synthesis of the explicitly described problem rules. This paradigm is considered essential for out-of-distribution testing, providing a more accurate representation of a model's true reasoning generalization.
2. Task Categories and Structure
KOR-Bench comprises five primary task categories, each crafted to evaluate a distinct facet of reasoning while controlling for knowledge exposure:
Task Category | Evaluated Aspect | Example Challenge |
---|---|---|
Operation | Mathematical rule execution | Apply user-defined operations such as |
Logic | Logical abstraction | Formulate expressions using new logical symbols |
Cipher | Encoding/decoding procedure | Encrypt/decrypt with dynamically specified ciphers |
Puzzle | Constraint-based problem solving | Solve puzzles under explicit framework (e.g., Sudoku-like) |
Counterfactual | Hypothetical/fictive contexts | Infer outcomes in alternate/fictionalized scenarios |
Each task presents LLMs with novel rule descriptions and problem statements, explicitly requiring them to infer, execute, and generalize previously unseen operations or contexts.
3. Evaluation Methodology and Metrics
KOR-Bench assesses models under rigorous conditions designed to prevent information leakage from pretraining. The primary metric is task accuracy, reflecting the proportion of correctly solved instances under the newly specified rules.
Experiments demonstrate that specialized models—O1-Preview (72.88% accuracy) and O1-Mini (70.16% accuracy)—substantially outperform strong general LLMs such as Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%). This comparative performance highlights the effectiveness of KOR-Bench in differentiating true reasoning adaptability instead of pattern-matching proficiency.
Further analyses include:
- Bottleneck identification: Cipher tasks, especially those requiring spatial encoding or indexing, show the greatest performance gaps.
- Sample complexity: Zero-shot settings result in pronounced error rates, while providing three-shot context with only questions markedly enhances reasoning accuracy.
- Ablation studies: Downsampling the dataset or modifying prompt complexity helps trace the limits of rule induction capacities.
4. Analysis of Reasoning Strategies
KOR-Bench integrates advanced experimental methodologies for dissecting model reasoning:
- Stepwise Prompting: Tasks are decomposed into sub-steps, enabling granular observation of intermediate inference and error correction. In Cipher tasks, iterative “self-correction” via two rounds of stepwise prompting yields the optimal performance, demonstrating the utility of explicit scaffolding in complex rule applications.
- Trick Injection in Puzzles: Employing strategic guidance ("Tricks") at critical junctures in puzzle tasks can simplify solution trajectories and improve end-to-end task accuracy.
- Rule-Focused Attention Visualization: Attention mapping across input rules vs. other components reveals which task elements attract the model’s primary focus during reasoning, elucidating the sources of correct and erroneous solutions.
5. Correlation, Dataset Composition, and Scaling
Correlation analyses show that variation in dataset size and question complexity directly impacts benchmark outcomes. Tasks with more intricate rule definitions or larger input spaces present steeper learning curves, emphasizing the limits of sample efficiency and adaptation in LLMs. Zero-shot vs. few-shot experiments illustrate that models benefit disproportionately from even sparse contextual exposures, suggesting strong reliance on example-based abstraction rather than true zero-shot reasoning.
Ablation across datasets and benchmarking correlations further indicate that performance on KOR-Bench has minimal alignment with benchmarks that have overlapping knowledge components, confirming its orthogonality with respect to pretraining content.
6. Prospects for Multimodal and Advanced Extensions
The architectural framework of KOR-Bench allows for continuous dataset augmentation, parameterized variations of rule structures, and the integration of visual domains. The paper explicitly advocates for future development of multimodal versions—enabling simultaneous evaluation of rule induction and reasoning across modalities—in order to holistically assess next-generation LLM reasoning capabilities.
KOR-Bench thereby serves as both a robust instrument for dissecting reasoning skill independent of knowledge, and a methodological foundation for probing the boundaries of dynamic rule generalization in artificial intelligence research.