Camlang-CSQA-v0 Evaluation
- Camlang-CSQA-v0 is a benchmark designed to assess the metalinguistic reasoning of LLMs through a novel constructed language with explicit grammatical resources.
- It requires models to perform precise morphosyntactic parsing, lexical mapping via a bilingual dictionary, and semantic integration of commonsense inference.
- Experimental results highlight a significant gap, with LLMs underperforming sharply against human scores, emphasizing the need for enhanced rule-based reasoning architectures.
Camlang-CSQA-v0 is a diagnostic evaluation task designed to test the metalinguistic deductive reasoning capabilities of LLMs. It leverages Camlang, a novel constructed language with explicit resources (grammar book and bilingual dictionary), and adapts the CommonsenseQA paradigm to require mastery of morphosyntactic rules and lexical mapping. The task distinguishes itself by exposing fundamental discrepancies between the proficient pattern matching exhibited by LLMs on familiar data and their limited ability to internalize and systematically apply explicit grammatical systems—a haLLMark of human language acquisition.
1. Design Principles of the Camlang Language
Camlang is intentionally engineered to maximize the challenge for LLMs in the domain of metalinguistic reasoning. Its typological design draws from attested but rarely co-occurring linguistic features and combines them into an explicit, unfamiliar arrangement. Phonologically, Camlang incorporates mechanisms such as Turkic-style vowel harmony and Celtic-influenced consonant mutations. Morphologically, the language is both highly agglutinative and polysynthetic, employing a broad array of morphophonological processes including prefixation, suffixation, circumfixation, and clitication.
Functional markers in Camlang span tense, aspect, evidentiality, case marking, and agreement, distributed via diverse affixal and clitic deployments. For interrogative clauses, construction requires both verb fronting and attachment of specialized proclitics (e.g., “nAs=”) integrating harmonizing vowel features. Every morpho-phonological transformation is governed by the explicit rulebook and illustrated using formal glosses:
1 2 3 4 5 6 7 |
\begingl \glpreamble lichéwcymyÅür// \gla Segmented: li- chew {} -cy -my -Åür {}// \glb Morphemic: lI= x= cew -RED %%%%0%%%% -s =jUr// \glb Gloss: 2SG= EZ= answer -PROG -NMLS -GEN =at// \glft ‘when you are answering’// \endgl |
A critical property is that Camlang resource materials (grammar book and dictionary) are fully explicit, permitting deductive rule application while precluding recourse to prior training data patterns.
2. Task Construction: Camlang-CSQA-v0
To probe LLMs’ ability for explicit rule-based reasoning, Camlang-CSQA-v0 repurposes CommonsenseQA by translating 47 multiple-choice instances into Camlang, each providing one commonsense question and six answer options. Questions and options are expressed in the synthetic language using the grammar and lexical resources only. A model (or human) must parse morphologically dense strings, segment them according to rulebook instructions, translate via the bilingual dictionary, and map the resulting semantics onto world knowledge to answer the question.
This design ensures that memorized statistical patterns from massive pretraining corpora are unhelpful. Success requires a chain of reasoning: (i) morphosyntactic parsing, (ii) lexical mapping, (iii) composition of semantic meaning, and (iv) integration with commonsense inference.
3. Experimental Findings: LLMs vs Human Performance
A series of exact-match (EM) accuracy evaluations highlight the magnitude of the challenge. On the English version of CommonsenseQA, models reach high EM scores (up to 98% for GPT-5). When transitioned to Camlang, performance degrades severely—EM drops to 47% for GPT-5, with other models (including GPT-4o, GPT-o3, GPT-o4-mini, DeepSeek-R1) scoring between 21% and 46%. In contrast, a human participant using only the grammar book and dictionary achieves 87% EM on Camlang-CSQA-v0.
These results strongly suggest that state-of-the-art LLMs, despite high proficiency on standard knowledge benchmarks, lack mechanisms for explicit deductive learning analogous to those used by human linguists and language learners.
4. Metalinguistic Reasoning Verification
To determine the source of observed successes and failures, a human verification protocol classifies model reasoning traces into three categories:
- Strict Human-Verified Accuracy (SHV ACC): Full and correct parsing, translation, and semantic mapping.
- Moderate Human-Verified Accuracy (MHV ACC): Accepts incomplete semantic explanations if syntactic parsing is correct.
- Lenient Human-Verified Accuracy (LHV ACC): Permits answers based on partial or shallow lexical alignment.
Empirical findings reveal that even the best model, GPT-5 in context-only setups, achieves 0% SHV ACC, only ≈30% at the lenient level (LHV ACC), while humans are markedly higher (55% SHV, 60% MHV, 68% LHV). This indicates that most correct model outputs arise from guesswork or shallow lexical mapping, rarely from systematic grammatical deduction. Reasoning traces typically lack full application of rules or correct morphosyntactic segmentation.
5. Cognitive and Technical Significance
Camlang-CSQA-v0 represents a cognitively motivated benchmark. While typical NLP benchmarks may permit models to exploit statistical regularities or data leakage, Camlang-CSQA-v0 is immune to such pitfalls due to its novelty and fully explicit formal resources. It exposes a fundamental gap in current LLM architectures: an inability to generalize metalinguistic deductive reasoning from instructions in the grammar to structured operations over unfamiliar linguistic input.
The substantial performance gap between humans and LLMs on this task—and the detailed qualitative evidence from reasoning trace analysis—suggests that high performance on standard benchmarks may overstate true language understanding abilities. Shallow pattern recognition can masquerade as reasoning in familiar languages but quickly fails in an unfamiliar, explicitly defined linguistic environment.
6. Implications for LLM Development and Evaluation
The results from Camlang-CSQA-v0 point directly to architectural and training limitations. Current transformer-based LLMs, while proficient in context extrapolation, do not reliably simulate human-like rule acquisition or explicit logical manipulation required for manipulating explicit constructed grammars. Models show only rudimentary, fragmentary metalinguistic competence and cannot operationalize complex rule chains required by the Camlang paradigm.
A plausible implication is that future research should prioritize:
- Training procedures or architectures that encourage symbolic and rule-governed linguistic reasoning,
- Diagnostic benchmarks using unfamiliar, explicit linguistic systems to gauge true reasoning (avoiding data leakage),
- Protocols for integrating explicit external resources with model inference in a compositional manner.
Camlang-CSQA-v0 establishes a standard for evaluating metalinguistic learning, posing a cognitively principled challenge not only for NLP but for broader computational models aiming at human-like generalization and deductive learning.