Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 85 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

KOR-Bench: Measuring Rule Generalization

Updated 28 July 2025
  • KOR-Bench is a benchmark suite that isolates language models' intrinsic reasoning by using tasks crafted to minimize pretraining data influence.
  • It categorizes tasks into operation, logic, cipher, puzzle, and counterfactual to rigorously evaluate rule induction and dynamic generalization.
  • The evaluation employs stepwise prompting and ablation studies to identify performance gaps and guide future multimodal extensions.

KOR-Bench is a benchmark suite explicitly formulated for evaluating LLMs' reasoning abilities in contexts that are deliberately constructed to be independent of domain-specific pretraining knowledge. The primary goal is to systematically minimize models' reliance on memorized or training data–derived knowledge in order to measure reasoning capacities in out-of-distribution settings. KOR-Bench introduces a collection of tasks that require applying newly described rules to solve problems, thereby emphasizing the capacity for dynamic rule generalization and adaptation rather than recall. Its architecture, methodology, and adoption of knowledge-orthogonality represent a significant methodological shift in the assessment of advanced LLMs.

1. Conceptual Basis: Knowledge-Orthogonal Reasoning

Knowledge-Orthogonal Reasoning (KOR) is an evaluation paradigm designed to isolate and scrutinize the intrinsic reasoning faculties of LLMs. It addresses the confounding effect wherein large pretrained models perform well on traditional reasoning benchmarks by leveraging prior exposure to similar tasks or facts during pretraining. In KOR, tasks are constructed such that success cannot be attributed to memorized patterns; instead, inference is possible only through the correct, on-the-fly interpretation and synthesis of the explicitly described problem rules. This paradigm is considered essential for out-of-distribution testing, providing a more accurate representation of a model's true reasoning generalization.

2. Task Categories and Structure

KOR-Bench comprises five primary task categories, each crafted to evaluate a distinct facet of reasoning while controlling for knowledge exposure:

Task Category Evaluated Aspect Example Challenge
Operation Mathematical rule execution Apply user-defined operations such as ab=aba \triangle b = a^b
Logic Logical abstraction Formulate expressions using new logical symbols
Cipher Encoding/decoding procedure Encrypt/decrypt with dynamically specified ciphers
Puzzle Constraint-based problem solving Solve puzzles under explicit framework (e.g., Sudoku-like)
Counterfactual Hypothetical/fictive contexts Infer outcomes in alternate/fictionalized scenarios

Each task presents LLMs with novel rule descriptions and problem statements, explicitly requiring them to infer, execute, and generalize previously unseen operations or contexts.

3. Evaluation Methodology and Metrics

KOR-Bench assesses models under rigorous conditions designed to prevent information leakage from pretraining. The primary metric is task accuracy, reflecting the proportion of correctly solved instances under the newly specified rules.

Experiments demonstrate that specialized models—O1-Preview (72.88% accuracy) and O1-Mini (70.16% accuracy)—substantially outperform strong general LLMs such as Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%). This comparative performance highlights the effectiveness of KOR-Bench in differentiating true reasoning adaptability instead of pattern-matching proficiency.

Further analyses include:

  • Bottleneck identification: Cipher tasks, especially those requiring spatial encoding or indexing, show the greatest performance gaps.
  • Sample complexity: Zero-shot settings result in pronounced error rates, while providing three-shot context with only questions markedly enhances reasoning accuracy.
  • Ablation studies: Downsampling the dataset or modifying prompt complexity helps trace the limits of rule induction capacities.

4. Analysis of Reasoning Strategies

KOR-Bench integrates advanced experimental methodologies for dissecting model reasoning:

  • Stepwise Prompting: Tasks are decomposed into sub-steps, enabling granular observation of intermediate inference and error correction. In Cipher tasks, iterative “self-correction” via two rounds of stepwise prompting yields the optimal performance, demonstrating the utility of explicit scaffolding in complex rule applications.
  • Trick Injection in Puzzles: Employing strategic guidance ("Tricks") at critical junctures in puzzle tasks can simplify solution trajectories and improve end-to-end task accuracy.
  • Rule-Focused Attention Visualization: Attention mapping across input rules vs. other components reveals which task elements attract the model’s primary focus during reasoning, elucidating the sources of correct and erroneous solutions.

5. Correlation, Dataset Composition, and Scaling

Correlation analyses show that variation in dataset size and question complexity directly impacts benchmark outcomes. Tasks with more intricate rule definitions or larger input spaces present steeper learning curves, emphasizing the limits of sample efficiency and adaptation in LLMs. Zero-shot vs. few-shot experiments illustrate that models benefit disproportionately from even sparse contextual exposures, suggesting strong reliance on example-based abstraction rather than true zero-shot reasoning.

Ablation across datasets and benchmarking correlations further indicate that performance on KOR-Bench has minimal alignment with benchmarks that have overlapping knowledge components, confirming its orthogonality with respect to pretraining content.

6. Prospects for Multimodal and Advanced Extensions

The architectural framework of KOR-Bench allows for continuous dataset augmentation, parameterized variations of rule structures, and the integration of visual domains. The paper explicitly advocates for future development of multimodal versions—enabling simultaneous evaluation of rule induction and reasoning across modalities—in order to holistically assess next-generation LLM reasoning capabilities.

KOR-Bench thereby serves as both a robust instrument for dissecting reasoning skill independent of knowledge, and a methodological foundation for probing the boundaries of dynamic rule generalization in artificial intelligence research.