Open Taxonomy for LRM Reasoning

Updated 1 October 2025

LOT is an inductive, language-model-guided framework that dynamically generates human-readable taxonomies characterizing LRM reasoning patterns.
It employs iterative feature induction and encoding with logistic regression to distinguish reasoning traces with high classification accuracy.
The framework enhances model interpretability and guides performance improvements by revealing nuanced, actionable reasoning differences.

The LLM-proposed Open Taxonomy (LOT) is an inductive, language-model-guided framework for generating data-driven, human-interpretable taxonomies that characterize the reasoning patterns of large reasoning models (LRMs). Distinct from static, predefined taxonomies, LOT leverages a generative LLM to iteratively extract, refine, and formalize discriminative features from the natural-language reasoning traces produced by different LRMs on common tasks. The resulting taxonomy is thus “open”: its feature vocabulary is discovered dynamically and is tailored to reveal how various models “think” differently. LOT simultaneously yields a statistical classifier that can distinguish the origin of a reasoning trace with high accuracy and a natural-language explanation that offers granular insights into model-specific reasoning behaviors (Chen et al., 29 Sep 2025).

1. Rationale and Definition

LOT arises from the need to move beyond macro-level accuracy metrics when comparing LRMs and instead characterize the qualitative nature of model reasoning. Rather than imposing a fixed set of behavioral categories, LOT constructs an empirical taxonomy induced from the reasoning outputs themselves via comparison and annotation by an LLM. The process is inductive and iterative: features are repeatedly proposed based on pairwise model trace comparisons, validated through encoding and classification, and expanded until no further discriminative value is gained. The open taxonomy is therefore both a set of natural-language feature definitions and an empirical distribution describing their prevalence across different LRMs.

The underlying motivation is to determine whether and how LRMs differ in reasoning style, to what extent these differences account for performance discrepancies, and to produce actionable, human-readable taxonomic characterizations.

2. Core Methodology

The LOT pipeline comprises three tightly integrated stages:

Feature Induction via LLM Annotation:
- Given paired reasoning traces (a, b) from two LRMs (A, B) on the same input (e.g., a math or science problem), an LLM is prompted to compare the traces and propose distinguishing features, each described in natural language (e.g., “verifies solution by re-evaluating constraints” or “repeats the same information without progress”).
- Features are worded as Boolean or countable properties that can be automatically detected in new traces.
Feature Encoding and Iterative Taxonomy Refinement:
- Each reasoning trace is encoded as a vector according to the presence (binary, PoR—Presence of Reasoning) or frequency (BoR—Bag of Reasoning) of the current feature set.
- A logistic regression classifier φ: x → y predicts the source model for each trace based on these features.
- Whenever a trace is misclassified, the LLM is prompted with this example to induce novel features that can better separate the models. Features are incrementally added, and the classifier retrained, until taxonomy convergence.
Converged Taxonomy and Model Attribution:
- The final taxonomy is a set of linguistic features, formalized as an empirical mapping from features to model origins. The classifier achieves high discrimination accuracy (up to 80–100% in the paper’s experiments) when features are sufficiently expressive.
- The taxonomy is both a diagnostic (classifying traces by origin) and a descriptive (explaining model differences in human terms) tool.

A simplified pseudocode (as given in Algorithm 1, (Chen et al., 29 Sep 2025)) is:

for (a, b) in pairs of traces:
    features = LLM_annotator.compare(a, b)
    assign features to traces
    train classifier φ(features) → {A, B}
    if misclassification:
        update features using misclassified examples

3. Taxonomy Structure and Representation

LOT’s taxonomy is not a static tree or fixed set of dimensions but, rather, a fluid set of linguistic feature definitions reflecting the empirical discriminability of model reasoning. Key elements include:

Feature Set: Each feature is a concise natural-language description (e.g., “verifies methodology’s applicability,” “relies on code simulation,” “engages in circular reasoning”).
Encoding: Features are encoded per trace as binary indicators (PoR) or as integer counts (BoR), forming a high-dimensional representation suitable for classification.
Distributional Modeling: By analyzing the empirical frequency of features across model outputs, LOT links distinctive traits (e.g., frequent verification steps, tendency to write code, repeated evaluations) with particular LRMs or training paradigms.
Human Interpretation: The taxonomy provides direct, interpretable rationales for classification decisions, mapping statistical separability onto linguistically meaningful differences.

The taxonomy construction halts after a fixed number of non-updating iterations or upon reaching a maximum sample size, yielding a robust, converged taxonomy.

4. Empirical Application and Results

LOT was systematically evaluated across math, science, and code domains using reasoning traces from 12 open-source LRMs, including models of varying scale, architecture, and pretraining/fine-tuning recipes:

Task Domains: MATH-500, AIME-24/25 (mathematics), GPQA-Diamond (graduate-level science), CRUXEVAL and LiveCodeBench (coding/execution).
Comparisons: Pairwise model taxonomy construction allowed binary classification of reasoning traces by source model with 80–100% accuracy for model pairs differing by scale, base family, or domain specialization.
Qualitative Features: For example, Qwen3-32B was found to “verify solution constraints,” while smaller Qwen3 variants fell into “circular reasoning patterns;” Qwen3-14B focused on “simulating code execution,” while others emphasized different facets; Seed-Coder-8B-Reasoning produced “coding-style solutions” even for non-code problems, reflecting fine-tuning inertia.
Case Study — Intervention: By aligning the reasoning styles (through editing summaries and re-expanding traces) of smaller Qwen3 models toward those of Qwen3-32B, accuracy on GPQA improved by 3.3–5.7%. The odds ratio of feature occurrence to outcome further quantified the causal role of identified features.

5. Interpretive and Practical Significance

LOT demonstrates:

Human-Readable Model Comparisons: It reveals that LRMs can be robustly distinguished, not merely by output accuracy, but by the structure and nature of their reasoning as captured by a learned, open feature set.
Model Improvement Pathways: By identifying which reasoning behaviors correlate with high performance, practitioners can intervene at the level of reasoning processes (e.g., emphasizing “verification steps” or discouraging “uncommitted repetition”) in instruction tuning or post-processing pipelines.
Complement to Quantitative Metrics: Whereas aggregate scores conceal model idiosyncrasies, LOT’s taxonomy uncovers fine-grained, actionable reasoning differences.
Inductive and Extensible Design: The LLM-driven, data-guided feature discovery process accommodates new models, domains, and evolving behaviors, sidestepping the limitations of manual, expert-imposed taxonomies.

6. Limitations and Future Research

The paper identifies promising directions:

Causal Attribution: While feature occurrence correlates with performance, further work is needed to causally connect reasoning traits with model success or failure.
Taxonomy Stability: Despite cross-seed robustness, variations in induced taxonomies suggest that model training conditions and sampling strategies may influence discovered features.
Instruction Tuning with Taxonomic Guidance: Integrating LOT-derived features into training objectives may directly shape how models are prompted or fine-tuned for desired reasoning styles.
Data Selection and Benchmark Design: LOT features could inform new benchmark curation, evaluating models not just by solutions but by process.

A plausible implication is that widespread adoption of LOT-style analysis could standardize the reporting of LRM reasoning characteristics, enriching both model interpretability and the design of next-generation instruction-following systems.

7. Summary Table: Key Features of LOT

Aspect	Description	Example from Data
Inductive Process	Features induced iteratively via LLM annotation	“verifies applicability” vs. “circular reasoning”
Encoding	Binary/count feature vectors (PoR/BoR) per trace	Trace encoded as [1,0,0,1,...] or [2,1,0,...]
Model Assignment	Logistic regression discriminates origin from feature distribution	80–100% accuracy between open-source LRMs
Intervention	Reasoning-editing pipeline improves model performance	GPQA accuracy gain of 3.3–5.7% on Qwen3 test case
Human Readability	Features and taxonomy interpretable and linguistically explicit	“simulates code execution,” “recalls task constraints”
Extensibility	Taxonomy expands with more data or new model pairs	Features updated until taxonomy convergence

The LOT framework constitutes a principled, empirically validated methodology for extracting, modeling, and interpreting reasoning differences across LRMs, providing both a technical diagnostic tool and an interpretive lens on machine reasoning at scale (Chen et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Your thoughts tell who you are: Characterize the reasoning patterns of LRMs (2025)

Follow Topic

Get notified by email when new papers are published related to LLM-proposed Open Taxonomy (LOT).