Cattell-Horn-Carroll (CHC) Theory Overview
- CHC theory is a hierarchical model that defines intelligence with general, broad, and approximately 70 narrow cognitive abilities.
- It underpins psychometric assessments by mapping broad abilities like fluid reasoning and crystallized knowledge to specific test items.
- Empirical studies adapting CHC protocols for LLMs reveal paradoxes in construct validity, challenging traditional intelligence measurement.
The Cattell-Horn-Carroll (CHC) theory provides a hierarchical model of cognitive abilities, with general intelligence (“g”) at its apex, a layer of broad cognitive abilities beneath, and approximately seventy narrow abilities at the base. This structure has become foundational within psychometrics for the design, analysis, and interpretation of human intelligence assessments. Recent inquiries into the empirical compatibility of CHC frameworks with the evaluation of LLMs reveal paradoxes that destabilize traditional cross-domain measurement. Systematic analyses of state-of-the-art LLMs using CHC-based psychometric protocols demonstrate critical mismatches in construct validity, challenging anthropomorphic assumptions about algorithmic and biological cognition (Reddy, 23 Nov 2025).
1. The Hierarchical Structure of the CHC Theory
The CHC theory organizes cognitive abilities into three strata:
- Stratum III (“g”): General intelligence factor, historically inferred as the source of shared variance among cognitive tests.
- Stratum II (Broad Abilities): Nine broad cognitive abilities, including:
- Gf (Fluid Reasoning)
- Gc (Crystallized Knowledge)
- Gq (Quantitative Reasoning)
- Grw (Reading/Writing Ability)
- Additionally: Processing Speed, Visual-Spatial Ability, Short-Term Memory, Long-Term Storage and Retrieval, Auditory Processing
- Stratum I (Narrow Abilities): Approximately 70 distinct, more specialized abilities (e.g., inductive reasoning, vocabulary depth).
CHC-based test development decomposes cognitive constructs along these lines, with broad abilities assessed using diverse item types and narrow abilities sampled for domain coverage.
2. Application of CHC to LLM Evaluation
Evaluations of LLMs under the CHC paradigm operationalize each broad ability as a “suite of psychometric-style items.” For example, Gf is mapped to matrix reasoning tasks; Gc to trivia, fact recall, and definitional knowledge; Gq to word problems or symbolic manipulations; Grw to passage comprehension and text production. All items are presented as text prompts and are adapted from standard human test banks.
The evaluation workflow encompasses:
- Prompt Formatting: Human-designed items reformulated for LLM compatibility.
- Coverage Sampling: Ensuring that subdomains (Stratum I abilities) are proportionately represented for each broad factor.
- Response Scoring: Employing two complementary scoring regimes:
- Binary Accuracy: Exact-match criterion consistent with classical psychometrics—responses must match predefined keys.
- LLM-judge Scoring: An independent LLM acts as an adjudicator, scoring on conceptual correctness (full, partial, or incorrect), independent of response formatting.
3. Statistical Methods and Key Metrics
Rigorous statistical frameworks underpin the evaluation of LLM-CHC alignment:
- Judge–Binary Correlation: Across all domains, the correlation between LLM-judge and binary accuracy scores is (), yielding . Thus, binary accuracy accounts for only ~3.1% of the variance in LLM-judge evaluations.
- Item Response Theory (IRT): The 2PL model below characterizes item difficulty and discrimination, with the 3PL form also discussed:
where is item discrimination and is item difficulty. The 3PL extension introduces a guessing parameter .
- Paradox Severity Index (PSI): To quantify the disconnect between judge-based and binary scores, PSI is defined as:
where and are mean domain accuracies, and 0 is model IQ from Classical Test Theory.
4. Empirical Paradoxes in CHC-Based LLM Evaluation
Empirical results reveal domain-specific artifacts inconsistent with valid cognitive measurement:
- Broad Ability Results:
- Gf (Fluid Reasoning): Judge ≈ 0.59, Binary ≈ 0.42 (gap ≈ 0.17)
- Gc (Crystallized Knowledge): Judge ≈ 0.37, Binary = 1.00 (gap ≈ –0.63)
- Gq (Quantitative): Judge ≈ 0.71, Binary ≈ 0.26 (gap ≈ 0.45)
- Grw (Reading/Writing): Judge ≈ 0.50, Binary ≈ 0.10 (gap ≈ 0.40)
- Crystallized Knowledge Paradox: All evaluated LLMs achieve 100% binary accuracy (Gc), but judge scores for conceptual correctness range from 25% (GPT-4 Turbo) to 62% (Claude 3 Opus), violating convergent measurement assumptions. The probability of eight models achieving 100% on all 50 Gc items by chance is approximately 1. Such results constitute a category-level error, not a grading artifact.
- Paradox Severity Index: Leading models (e.g., Gemini 2.5 Flash: IQ = 121.4, Gap = 0.493, PSI ≈ 0.598) exhibit growing misalignment between standardized IQ and the binary/judge accuracy gap as model capability increases.
| Broad Ability | Judge Accuracy | Binary Accuracy | Accuracy Gap |
|---|---|---|---|
| Fluid Reasoning (Gf) | 0.59 | 0.42 | 0.17 |
| Crystallized (Gc) | 0.37 | 1.00 | -0.63 |
| Quantitative (Gq) | 0.71 | 0.26 | 0.45 |
| Reading/Writing(Grw) | 0.50 | 0.10 | 0.40 |
5. Theoretical and Methodological Implications
Emergent paradoxes in LLM assessment by CHC protocols expose fundamental challenges:
- Ontological Category Error: CHC theory is anchored in human-type cognition: serial, limited-capacity processing, embodied learning, and forgetting—all absent in transformer-based models.
- Analogous misapplied tests include using Snellen charts to assess non-ocular sensors or evaluating database memory with organic recall protocols.
- Illusory General Factor (“g”): In LLMs, latent “g” is a byproduct of dataset correlations (2), architecture constraints (3), prompt sensitivity (4), and tokenization artifacts (5):
6
- This diverges ontologically and functionally from “g” in human populations, which reflects biological and neurodevelopmental substrates.
- Measurement Theatre: The simultaneity of high binary accuracy and low judge correctness is not a scoring artifact, but reflects the inadequacy of substrate-independent measurement.
- Anthropomorphic Bias: Conventional frameworks obscure the assessment of non-human abilities, incentivize surface-level accuracy, and introduce errors in capability characterization relevant to alignment and safety considerations.
6. Toward Native Machine Cognition Assessment
To address the fundamental incommensurabilities between CHC-based frameworks and model architectures, a set of principles is proposed for machine-native cognitive evaluation:
- Capability-Based Assessment: Define and probe what systems can reliably do, independent of human analogs.
- Architecture-Aware Testing: Employ probes that reflect parallel attention, tokenization, and unique context window behaviors inherent to LLMs.
- Emergent Property Detection: Prioritize detection of genuinely new, compositional, or adversarial capabilities not attributable to rote pattern completion.
- Compositional Evaluation: Systematically combine primitives to test generalization and recombination potential.
- Information-Theoretic Metrics: Emphasize mutual information (e.g., bits transferred per output) and transformation complexity; for example, Machine Cognitive Throughput (MCT):
7
where 8 is estimated mutual information and Latency the mean response time.
Preliminary studies exhibit correlation between information-theoretic indices and downstream robustness in reasoning tasks, with item-level IRT parameters more closely aligned with architectural depth and attention head counts—distinct from human-normed scaling. This strategy re-contextualizes intelligence measurement for fundamentally alien, non-biological architectures (Reddy, 23 Nov 2025).