Mentalese-Style Compressed Reasoning

Updated 1 December 2025

Mentalese-style compressed reasoning is a framework that transforms detailed, multi-step reasoning into minimal, abstract tokens mirroring a 'language of thought.'
It employs symbolic encoding, latent vector compression, and reinforcement learning to achieve up to 16x token reduction while retaining 90–98% of baseline accuracy.
These methods reduce computational latency and memory footprints, offering robust implications for advanced reasoning models and cognitive architectures.

Mentalese-style compressed reasoning refers to computational and architectural approaches that realize internal, abstract, and highly compressed representations of multi-step reasoning, analogous to the hypothesized “language of thought” (mentalese) in cognitive science. These approaches seek to replace verbose, discrete chain-of-thought reasoning with ultra-compact, symbolic or latent “thought tokens,” substantially reducing computational burden and decoding latency in high-capacity reasoning models while preserving core logical structure and accuracy.

1. Foundations: Mentalese and Compression Principles

The “mentalese” hypothesis posits that human cognition operates over an internal, language-like substrate whose elements—abstract operators, predicates, and arguments—encode reasoning at a higher density and with greater cognitive economy than overt natural language. In formal reasoning models, this concept is operationalized via methods that compress explicit chain-of-thought traces into:

symbolic Mentalese tokens providing compositional operator/argument pairs (Tanmay et al., 28 Nov 2025),
continuous latent vectors capturing semantic and logical content (Cheng et al., 17 Dec 2024, Kuzina et al., 2 Oct 2025, Tan et al., 22 May 2025),
minimal conditional schemas exploiting supervenience for concise representation (Sileno, 2019).

Compression in this context entails both length reduction (brevity) and retention of critical inference steps (sufficiency) (Cheng et al., 17 Jun 2025). These aims align with the principle of cognitive economy: a compressed representation is just as expressive and correct as its verbose counterpart, but encoded in a minimal, lossless (or tolerably lossy) format.

2. Symbolic Mentalese Encoding and Discrete Compression

Symbolic approaches formalize Mentalese as a set $\mathcal{O}$ of operator tokens (e.g., SET, EQ, CASE, SOLVE, ANS) and a family of compositional expressions $\mathcal{E}$ over variables, constants, and functions. Reasoning traces are ultra-compressed to strings of the form $\mathsf{OPERATION} \colon \mathsf{expression}$ (e.g., EQ:$2x+5=17$;) (Tanmay et al., 28 Nov 2025).

The mapping

$\mathrm{CoT}_{\mathrm{NL}}(q)\;\longrightarrow\;\mathrm{CoT}_{\mathsf{M}}(q)\;=\;(o_1\!:\!c_1;\dots;o_T\!:\!c_T)$

collapses multi-line natural language into a concise Mentalese chain, typically compressing token count by factors of 4x–16x on mathematically intensive benchmarks. These transformations retain semantic equivalence without verbose connectors and facilitate efficient, modular reasoning phase separation.

Preference for shorter correct traces is enforced via Shorter Length Preference Optimization (SLPO): a reinforcement learning scheme that rewards minimal Mentalese token length when semantic correctness is maintained, trading compression against the risk of under-reasoning. This mechanism achieves dramatic gains in latency, inference cost, and reasoning clarity, while sustaining up to 98% of baseline accuracy (Tanmay et al., 28 Nov 2025).

3. Latent Compressed Reasoning: Continuous Approaches

Latent methods compress reasoning into continuous high-dimensional vectors (“thought tokens”) carrying the abstract content of multi-step inference in dense form. Compressed Chain-of-Thought (CCoT) replaces explicit token-level CoT sequences with autoregressively generated contemplation vectors $\mathbf{c}_1,\dots,\mathbf{c}_K \in \mathbb{R}^d$ where $K \ll m$ (the chain-of-thought length) (Cheng et al., 17 Dec 2024).

The process optimizes both answer accuracy and compression loss for best semantic approximation: $L(\phi,\psi) = L_{\mathrm{task}}(\psi|\phi) + \lambda L_{\mathrm{compress}}(\phi)$ with $L_{\mathrm{compress}}$ enforcing that CCoT vectors match full CoT state representations.

Integration with off-the-shelf decoder LMs is achieved by injecting compressed vectors into each Transformer layer’s key/value cache. Reasoning capacity (and trade-off between latency and accuracy) is directly adjustable via compression ratio $r$ and vector count $K$ , enabling dynamic efficiency control.

Variants such as KaVa distill the compressed teacher model’s KV-cache directly into continuous latents for the student, aligning every step, head, and layer across reasoning (Kuzina et al., 2 Oct 2025). This yields near-CoT fidelity with sharply reduced computational resource footprints.

Frameworks such as CoLaR perform dynamic latent compression at inference by merging consecutive token embeddings proportional to a user-specified compression factor $c$ (Tan et al., 22 May 2025). Probabilistic latent heads predict compressed chain evolution, optimizing both for next-token and next-embedding objectives under RL.

4. Step-Level and Conditional Compression Strategies

Fine-grained compression mechanisms estimate the importance or redundancy of each reasoning step, enabling policies that allocate length sparingly to critical steps and prune uninformative or repetitive segments.

SmartThinker implements online importance estimation: steps are ablated one-by-one, measuring their marginal effect on answer probability, and then length is adjusted accordingly (He et al., 6 Jul 2025). This produces compressed reasoning chains that preserve key inference acts (“mentalese skeletons”), leading to about 43–45% reduction in tokens with improvements in correctness and stability on multiple benchmarks.

LC-R1 targets invalid thinking by detecting and removing post-answer verification steps, enforcing termination at the first correct answer (Cheng et al., 17 Jun 2025). The Valid Thinking Rate (VT) metric quantifies compression efficacy, with VT rates exceeding 97% under LC-R1 policy optimization, reducing redundant tokens by 46–52% with only 2% accuracy loss.

Entropy-Guided Compression frameworks identify an intrinsic conflict between compression (which drives entropy down for shorter chains) and accuracy-based training (which drives entropy up, expanding chain length with exploratory connector tokens) (Zhu et al., 18 Nov 2025). Alternating compression (entropy descending) and exploration (entropy ascending) phases, these models bundle subroutines into atomic compressed operations, sculpting a symbolic skeleton that mirrors mentalese primitives.

5. Semantic Compression and Hidden Thought Architectures

Hidden Chain-of-Thought (HCoT) architectures implement semantic compression by mapping entire reasoning steps into singular token-embeddings ([CoT] tokens) via auxiliary models optimized with contrastive objectives (Liu et al., 13 Sep 2024). Each [CoT] embedding encodes the full content of a reasoning segment, permitting fast, modular decoding conditioned on these latent representations.

HCoT models achieve competitive or superior accuracy compared to full CoT methods while improving wall-clock inference speed (up to 2.8x) and drastically reducing token outputs. The approach is fully compatible with hierarchical or multi-token mentalese representations, and supports modular reasoning phases for planning, code synthesis, and agent invocation.

6. Theoretical Foundations: Compression, Supervenience, and Cognitive Economy

Compression is formally tied to supervenience: a reasoning schema $B$ compresses $A$ if $B$ supervenes $A$ and $B$ 's descriptor length is strictly less than $A$ 's (Sileno, 2019). In logic conditionals, this necessitates closure operations on antecedent or consequent to restore supervenience, thus enabling compression. Cognitive economy dictates that concepts should only be introduced if they reduce the representational burden—compressed chains meet this principle, operationalizing the ‘language of thought’ in practice.

Recursive data compression, as proposed in grounded AGI frameworks, realizes adaptive inductive bias and simulation-based reasoning (Franz, 2015). By iteratively compressing model representations and simulating grounded inference in a latent “mental stage,” these architectures avoid hand-coded symbolic knowledge bases in favor of dynamically learned, compressed hypotheses.

7. Empirical Performance and Practical Impact

Mentalese-style compressed reasoning delivers tangible computational and operational improvements:

Framework	Compression Rate	Latency Reduction	Accuracy Retention
ORION (Tanmay et al., 28 Nov 2025)	4–16x tokens	5x	90–98% of baseline (DeepSeekR1)
CCoT (Cheng et al., 17 Dec 2024)	up to 20x tokens	16x	Significant, plateauing above r=0.2
KaVa (Kuzina et al., 2 Oct 2025)	60–90% KV-cache	76% cost reduction	Δ ≈ 4–8 pp vs CoT
LC-R1 (Cheng et al., 17 Jun 2025)	46–52% tokens	—	2% acc drop
HCoT (Liu et al., 13 Sep 2024)	1.6–2.8x speedup	Up to 4.2x tokens	Matches or exceeds full CoT

These methodologies demonstrate dramatic reductions in token count, decoding time, and memory overhead, while safeguarding or enhancing robustness and correctness across mathematical, logical, agentic, and question-answering benchmarks, including GSM8K, OlympiadBench, AMC, and Minerva-MATH.

8. Limitations, Challenges, and Future Directions

Limitations include verifier oracle inaccuracies (risking over-compression and correctness loss), the inability of hard token length caps to scale to highly complex proofs, potential for lossy semantic compression, and sensitivity to hyperparameter selection for brevity–accuracy trade-off. Ongoing research focuses on multi-domain generalization (e.g., code, planning), adaptive operator schemas, joint training of correctness verifiers, integration of symbolic and latent mentalese, meta-learning of compression parameters per problem instance, and deep alignment with human cognitive sketching.

Mentalese-style compressed reasoning, as articulated in these frameworks, stands as a unifying principle for the efficient, abstract, and robust implementation of high-level cognition in large reasoning models—transforming verbose surface reasoning into minimal, latent, and interpretable chains while maintaining expert-level performance.