Cross-Substrate Cognitive Evaluation
- The paper presents a multi-criteria framework that decomposes cognitive evaluation into fidelity, lawfulness, usability, beauty, and comprehensiveness.
- It utilizes universal test constructions, statistical cross-modal tests, and transfer-learning approaches to compare cognitive performance across diverse substrates.
- Empirical benchmarks show that hybrid and substrate-aware methods bridge the performance gaps between symbolic and neural systems in realistic evaluations.
Cross-substrate cognitive evaluation systematically investigates and compares cognitive competencies, representations, and reasoning mechanisms across fundamentally different computational, biological, and hybrid substrates. This domain addresses the methodological, theoretical, and empirical challenges arising when cognitive architectures based on symbolic reasoning (e.g., Soar, ACT-R) are evaluated alongside connectionist or generative neural architectures (e.g., transformers, LLMs), as well as during the assessment of artificial and biological systems using common frameworks. The practice includes the transfer of cognitive diagnosis models across domains and disciplines, the comparison of activation patterns across neural modalities, and the rigorous examination of measurement theory when human-centered psychometric constructs are applied to non-human substrates.
1. Conceptual Foundations and Evaluation Criteria
Cross-substrate cognitive evaluation requires meta-theoretical scaffolding that accommodates radical substrate heterogeneity—ranging from modular, hand-coded symbolic theories to learned neural architectures, and from biological brains to artificial agents and collectives. Rosenbloom and Lord (Rosenbloom, 3 Oct 2025) introduced a directed acyclic evaluation-criteria graph to simultaneously ground and differentiate symbolic and generative theories. Rather than reducing comparison to a single “accuracy” axis, the framework decomposes theory evaluation into the following intertwined criteria:
- Fidelity (Accuracy, Groundedness): The degree of fit between theory/system predictions and observed phenomena, distinguishing overall behavioral matches from support for specific model components.
- Lawfulness (Coherence; Principledness → Propositionality, Generality): Structural self-consistency and the presence of general mathematical or propositional bases.
- Usability (Accessibility; Completeness; Tractability; Clarity → Unambiguity, Minimality, Simplicity): Practical attributes, including clarity of theoretical constructs and operational tractability.
- Beauty (Clarity; Exuberance → Minimality, Power): The balance between simplicity/minimalism and representational power.
- Comprehensiveness (Breadth of phenomena): The ratio of a system’s cognitive coverage to the maximal domain of interest.
Systems are evaluated both at the architectural level (fixed, context-independent theory) and the system level (theory instantiated with full content: knowledge, parameters, skills). The framework’s meta-theoretical nature precludes new cost functions or statistical criteria; rather, it subsumes existing empirical findings and theoretical constructs (Rosenbloom, 3 Oct 2025).
2. Methodological Paradigms and Representative Frameworks
Approaches to cross-substrate evaluation include qualitative meta-analyses of theory/system traits (Rosenbloom, 3 Oct 2025), construction of universal cognitive tests (Dowe et al., 2013), cross-modal statistical hypothesis testing (Kalinina et al., 2019), and data-driven transfer-learning frameworks for cross-domain diagnosis (Liu et al., 18 Jan 2025, Wang et al., 27 Oct 2025). Key paradigms include:
- Directed multi-criteria mapping: Symbolic and neural architectures are assessed against a structured multi-dimensional criteria graph, exposing trade-offs and complementarity (Rosenbloom, 3 Oct 2025).
- Universal test construction: Formalizes cross-substrate generalization by maximizing agent performance over task interfaces, time–resolution trade-offs, reward structures, and sensorimotor mappings. A universal test score is defined as:
where is the agent, the task family, the set of configurations (interface, reward, etc.), and the agent’s score (Dowe et al., 2013).
- Statistical cross-modal pattern tests: Compare neural patterns across brain modalities or artificial sensors using permutation statistics (e.g., cross-modal permutation test, CMPT), which controls Type I error under exchangeability and robustly detects shared representations (Kalinina et al., 2019).
- Cross-domain transfer in cognitive diagnosis: Generalize student proficiency models by learning language-space mappers or shared cognitive feature extractors, validated via zero-shot transfer of model weights and empirical AUC/DOA performance (Liu et al., 18 Jan 2025, Wang et al., 27 Oct 2025).
3. Empirical Benchmarks and Domain-Transfer Experiments
Benchmarks for cross-substrate cognitive evaluation emphasize substrate-invariant protocols, wherein models, humans, or hybrid agents are tasked with the same set of stimuli and scoring rules, or are adapted by universal interface design. Examples include:
| Benchmark | Substrates Compared | Main Evaluation Feature |
|---|---|---|
| MME-CC (Zhang et al., 5 Nov 2025) | Multimodal LLMs (text+vision) | 11 tasks in spatial, geometric, and visual-knowledge reasoning; models must extract, abstract, and reason purely from visual information |
| LRCD (Liu et al., 18 Jan 2025) | Neural CDM across subject/platform | Language-description–based zero-shot transfer of student, concept, and exercise profiles hints at cross-domain cognitive model transfer |
| TLCD (Wang et al., 27 Oct 2025) | Neural CDM across disciplines | Shared feature extractors and fine-tuned heads enable transfer between conceptually aligned source–target subject pairs |
| CogBench (Song et al., 28 Feb 2024) | Human–LVLM | Identical image, question, and scoring setup; quantifies cognition and reasoning gaps per category |
The empirical results demonstrate that cross-substrate transfer is feasible: e.g., LRCD achieves 95–99% of fully retrained CDM’s accuracy on entirely unseen subjects/platforms, far exceeding random or indirect baselines (Liu et al., 18 Jan 2025); TLCD secures AUC gains of up to 4 points and RMSE/MAE reductions up to 10% via pre-trained cognitive feature extractors (Wang et al., 27 Oct 2025). However, even state-of-the-art vision–LLMs exhibit large gaps in high-level causal and event reasoning relative to human performance, under identical evaluation conditions (Song et al., 28 Feb 2024).
4. Statistical and Theoretical Issues in Substrate Comparison
Application of cognitive evaluation frameworks across substrates exposes foundational issues regarding the transferability and validity of measurement constructs. Notably, the imposition of human psychometric frameworks (e.g., Cattell-Horn-Carroll theory) on LLMs produces severe paradoxes. Binary accuracy (exact-match) and judge-scored accuracy can be statistically independent or even mutually contradictory, as reflected by scenarios where models attain perfect binary accuracy but display wide variance in judge scores. This reveals a category error, as constructs such as “g-factor” or “working-memory span” reflect neural organizational principles absent in artificial transformers (Reddy, 23 Nov 2025).
Formally:
- For crystallized knowledge items, all evaluated models achieve 100% binary accuracy (no variance), but judge scores are dispersed (0.25–0.62), leading to undefined or zero correlation, which is theoretically impossible if both measures track the same latent trait.
- Pearson (, ) supports the empirical disconnect (Reddy, 23 Nov 2025).
The implication is that anthropomorphic cognitive scales, when applied to non-human substrates, may reduce to measurement theatre. Instead, substrate-aware metrics—layer-wise probing, capability-based tasks, information-theoretic quantities (KL divergence, mutual information)—are necessary for machine-native cognitive assessment (Reddy, 23 Nov 2025).
5. Structural Insights and Taxonomic Analyses
Recent research formalizes taxonomies of cognitive operations, meta-cognitive controls, and representational strategies, making them observable and comparable across human, artificial, and hybrid systems. For example, a 28-element taxonomy encompassing logical, representational, and meta-cognitive behaviors reveals that humans typically engage more deeply in hierarchical organization, meta-cognitive monitoring, and decompositional strategies than do current LLMs, which predominantly rely on shallow forward chaining (Kargupta et al., 20 Nov 2025). Structural analyses of over 170K reasoning traces, spanning text, vision, and audio, show that models under-utilize behavior categories most closely correlated with success on ill-structured problems.
Moreover, test-time scaffolding—providing explicit behavioral sequence templates reflecting successful human reasoning structures—substantially improves LLM performance (up to 60% on complex problems), demonstrating that cross-substrate behaviors can be elicited by external interventions (Kargupta et al., 20 Nov 2025).
6. Theoretical Limits and Future Prospects
Cross-substrate universality is bounded by interface, reward, and adaptation limitations. Universal tests, as proposed by Dowe & Hernández-Orallo, are only upper semi-computable because the maximization over possible test configurations—and indefinitely long time horizons—cannot be completed in finite time (Dowe et al., 2013). Bias toward underestimation of ability is inherent, especially for agents with atypical timing or sensorimotor profiles. Adaptivity and exploration of configuration space are therefore necessary, but risk conflating innate competence with test-induced learning if repeated exposures are allowed (Dowe et al., 2013).
Looking ahead, properly substrate-aware evaluation frameworks must integrate:
- Meta-theoretical, multi-criteria scaffolds for theory-system comparison (Rosenbloom, 3 Oct 2025).
- Adaptive benchmarking maximizing performance across configuration parameters (Dowe et al., 2013).
- Information-theoretic and mechanistic metrics exposing true computation-level abilities (Reddy, 23 Nov 2025).
- Cross-modal, behavioral, and structural analyses diagnosing behavioral gaps and pointing to model improvement directions (Kargupta et al., 20 Nov 2025, Zhang et al., 5 Nov 2025).
7. Synthesis and Implications
Cross-substrate cognitive evaluation has demonstrated that no single substrate—symbolic or neural, biological or artificial—dominates on all dimensions of cognitive merit. Symbolic approaches currently lead in lawfulness, interpretability, and system-2–style metacontrol, while neural architectures excel at breadth, compressive expressivity, and pattern induction (Rosenbloom, 3 Oct 2025, Zhang et al., 5 Nov 2025). Hybrid models and composite evaluation strategies are required to overcome the current trade-offs. Critically, future progress hinges on abandoning anthropomorphic measurement assumptions in favor of efficient, substrate-native, and mechanistically informative evaluation principles. In sum, the maturation of cross-substrate evaluation is central to understanding whole-mind intelligence, advancing cognition-grounded AI research, and calibrating empirical assessments across the continuum of cognitive substrates.