Camlang: A Tool for Metalinguistic Reasoning

Updated 6 September 2025

Camlang is a constructed language featuring a unique integration of phonological, morphological, and syntactic elements to evaluate explicit metalinguistic reasoning in LLMs.
It provides explicit resources, including a detailed grammar book and a bilingual dictionary, to simulate adult second-language acquisition and enforce rule-based learning.
Performance analysis shows humans achieve near-native accuracy while LLMs struggle with systematic rule application and hierarchical morphological processing.

Camlang is a constructed language devised to evaluate and diagnose the metalinguistic reasoning capacities of LLMs under explicit deductive learning conditions. It combines phonological, morphological, and syntactic features in an attested yet typologically unattested configuration. Camlang was specifically engineered so that its structure is plausible on the level of individual features (each observed in natural languages) but novel in its overall integration, making it entirely unfamiliar to LLMs and human participants alike. The language is accompanied by two explicit resources—a comprehensive grammar book and a bilingual dictionary—designed to simulate adult second-language acquisition techniques and to enforce evaluation regimes requiring explicit rule-following. Camlang is operationalized through the Camlang-CSQA-v0 dataset, an adaptation of CommonsenseQA, and forms the basis of a cognitive evaluation paradigm for systematically probing the limits of LLM grammatical induction and reasoning (Liu et al., 30 Aug 2025).

1. Linguistic Structure and Design Principles

Camlang’s architecture is characterized by a combination of phonological complexity, morphosyntactic integration, and typological novelty.

Phonology: Includes an extensive inventory of aspirated voiceless stops with systematic phonological mutations, a Turkic-style eight-vowel harmonic system, and a strict (C)V(C) syllable structure that mandates epenthesis.
Morphosyntax: Presents a unified system where noun morphosyntax encodes plurality, definiteness, and case, while verbs express tense, aspect, evidentiality, mood, and agreement, frequently through cliticization, affixation, and compounding.
Typological Profile: Every atomic component is attested among natural languages, yet their coordination in Camlang is intentionally novel and unattested in any single natural language, a fact validated using typological similarity metrics such as

$\text{sim}(X,Y) = \frac{\sum_{f \in F_X \cap F_Y} \delta_f(X,Y)}{|F_X \cap F_Y|}$

where similarity is computed over shared feature sets $F_X$ and $F_Y$ .

The design rationale is to prevent models from leveraging superficial pattern matching and instead force reliance on the explicit rules articulated in the grammar book and dictionary.

2. Experimental Protocol and Resources

The experimental framework pits both LLMs and human participants against tasks requiring full utilization of Camlang’s explicit resources.

Resources: Two primary materials are provided: a detailed grammar book (in English, with Camlang examples) specifying phonology, morphology, and syntax, and a bilingual English–Camlang lexicon.
Zero-Shot Protocol: Participants solve tasks in Camlang without prior exposure or in-context learning, using only the grammar and dictionary. Models can access resources either by context inclusion (full-text prompts) or via explicit search tools. This setup is designed to simulate deductive learning as in adult second-language acquisition.
Task Construction: The CommonsenseQA dataset is adapted to Camlang, resulting in Camlang-CSQA-v0. Both questions and answer options are translated and, where necessary, entity names are rephrased so that all required lexicon is explicitly present in the dictionary.

This protocol isolates the metalinguistic reasoning capacity by making successful completion contingent on correct rule and lexical application, rather than on prior language exposure.

3. Performance Analysis and Error Taxonomy

The empirical results demonstrate a striking divergence between human and LLM capabilities on Camlang.

Performance Metrics:
- LLMs (GPT-5, GPT-o3, GPT-4o, DeepSeek-R1, etc.) achieve high accuracy (85–98% EM) on CommonsenseQA in English.
- When evaluated on Camlang-CSQA-v0, scores drop sharply (e.g., GPT-5: 47% EM; others: 21–40% EM).
- Human participants, using only the explicit resources, achieve near-native performance (~87% EM) on Camlang.
Error Analysis:
- Human verification schemes (SHV—Strict Human Verification, MHV—Moderate, LHV—Loose) indicate that many model “successes” arise from shallow lexical alignment, not systematic rule-following.
- LLMs often fail to segment and analyze morphologically complex forms required by Camlang’s grammar, neglecting phonological alternations, affix sequences, or hierarchical morphological composition.
- Explicit search (tool-based lookup) does not improve model performance due to the non-trivial morphophonological transformations between surface forms and dictionary entries.

This evidence indicates that, even with explicit grammatical instructions, current LLMs do not internalize and generalize rules at a level comparable to humans.

4. Implications for Metalinguistic Reasoning in LLMs

Camlang’s findings highlight core limitations in LLM architectures regarding explicit grammatical acquisition and deductive linguistic reasoning.

Pattern Recognition versus Rule Generalization: LLMs excel when inference can exploit superficial English patterns but falter in the face of rule-driven, unfamiliar systems requiring hierarchical parsing and novel feature integration.
Resource Integration Deficit: Despite emergent metalinguistic awareness in advanced models (e.g., occasional accurate application of single rules), there is no evidence of internalization or combinatorial mastery of grammatical systems as observed in human learners.
Shallow Reasoning Traces: Human verification shows that even when LLMs arrive at the correct answer, their intermediate traces seldom meet the SHV criterion for full deductive reasoning.

This suggests that LLMs rely predominantly on statistical matching rather than systematic, on-the-fly construction of grammatical models.

5. Diagnostic Value and Future Extensions

Camlang sets a new standard for evaluating metalinguistic reasoning in artificial systems and offers a foundation for further research.

Task Diversification: The framework is intended to expand beyond QA into translation, parsing, and formal grammaticality judgment, further dissecting specific linguistic challenges encountered by models.
Diagnostic Error Studies: Fine-grained analyses can identify bottlenecks in morphological segmentation, syntactic structure mapping, and resource retrieval, suggesting directions for targeted improvement.
Research Platform: Camlang’s explicit resource format enables controlled studies of learning curves, generalization, and the boundary between statistical induction and deductive reasoning in both machine and human learners.
Broader Cognitive Implications: The paradigm supports cross-disciplinary inquiries into language acquisition, metalinguistic awareness, and the computational modeling of explicit grammar learning.

6. Significance for LLM Development

By revealing a gap between surface-level competence and systematic grammatical mastery, Camlang provides critical evidence concerning the current limitations of LLMs.

Benchmarks and Modeling Advances: The performance collapse on Camlang benchmarks demonstrates the need for architectures and training regimes capable of explicit rule acquisition, robust hierarchical parsing, and integration of externally specified linguistic resources.
Tool Use Strategies: There is an explicit suggestion that improved retrieval (lemmatization, canonical form detection) and resource mapping could partially mitigate current deficits, but comprehensive mastery may require fundamentally novel approaches.

Camlang’s cognitively grounded, resource-intensive evaluation paradigm differentiates true linguistic rule learning from pattern matching and is positioned to guide the next generation of metalinguistically competent LLMs.

Table 1: Performance Comparison on Camlang-CSQA-v0

Model	EM Accuracy	Reasoning Trace (SHV)
GPT-5	~47%	Rarely satisifed
Humans	~87%	Consistently satisfied
Others	21–40%	Rarely satisfied

This table underscores the quantitative performance gap and qualitative difference in reasoning between current LLMs and human learners on explicit rule-based Camlang tasks.

In summary, Camlang constitutes a rigorous, typologically plausible, yet novel linguistic challenge specifically calibrated to assess and diagnose explicit, deductive metalinguistic reasoning in LLMs. The substantial deficit observed across all current models, in contrast with human performance, marks a crucial direction for advancing computational models of language understanding and acquisition.

Markdown Report Issue Upgrade to Chat

References (1)

The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Camlang.