Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

194 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

On the Thinking-Language Modeling Gap in Large Language Models (2505.12896v1)

Published 19 May 2025 in cs.CL, cs.LG, and stat.ML

Abstract: System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from LLMs pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the LLMing biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.

Summary

The paper identifies a language-thought gap, demonstrating that LLMs can inherit biases from natural language that impair logical reasoning.
It proposes the LoT prompting method, which guides models to observe, expand, and echo relevant information to reduce reasoning errors.
Experimental results on benchmarks like WinoBias and BBQ show that LoT prompting improves accuracy and reduces bias compared to standard chain-of-thought techniques.

This paper, "On the Thinking-LLMing Gap in LLMs" (2505.12896), investigates how the expression of language influences the reasoning capabilities of LLMs, particularly in tasks requiring System 2 (slow, logical) thinking. It posits that LLMs, trained on natural language, can inherit biases from human language expressions that deviate from the underlying "chain of thoughts," leading to flawed reasoning. The authors propose a novel prompting technique, termed LoT (Language of Thought), to mitigate this gap.

Core Problem: The Language-Thought Gap

The research identifies a "thinking-LLMing gap": while humans use a "language of thoughts" for reasoning, LLMs learn from written language, which is primarily for communication. This can lead to LLMs absorbing biases present in how humans express information, rather than modeling the underlying thought processes.

The paper formalizes this using Structural Causal Models (SCMs). It assumes observed tokens (language) are generated from latent variables representing human thoughts. Two key issues arise:

Language-Modeling Bias (Proposition 3.1): If training data presents information in an anti-topological order relative to the causal structure of thoughts (e.g., a conclusion appearing before all its premises), the LLM, due to its next-token prediction objective, learns to make inferences based on incomplete information. The paper illustrates this with a two-premise QA example ( $C_1 \rightarrow A \leftarrow C_2$ ). If the language order is $(L_{C1}, L_A, L_{C2})$ , the LLM might predict $L_A$ based only on $L_{C1}$ , marginalizing out $C_2$ : $\Pr ( L_A \mid L_1 ) = \sum_{C_1, C_2, A} \Pr (C_1 \mid L_1) \cdot \underbrace{\Pr (C_2)}_{\text{Bias}} \cdot \Pr (A \mid C_1, C_2) \cdot\Pr (L_A \mid A, L_1)$ .
Language-Thought Gap at Inference (Theorem 3.2): Even if an LLM learns the correct relationships between latent thoughts, it can still exhibit biased reasoning if the premises are expressed implicitly. The paper defines implicitness in two ways:
- L-explicitness: Local confusion, where the LLM struggles to understand the meaning of a specific piece of information from its expression $L_i$ (i.e., $\Psi(C_i=c^*_i \mid L_i)$ is small).
- q-explicitness: Global or contextual confusion, where the LLM struggles to understand $C_i$ given its expression $L_i$ and the preceding context $q_i$ (i.e., $\Psi(C_i=c^*_i \mid q_i, L_i)$ is small). Theorem 3.2 provides a lower bound on the reasoning error (KL divergence), showing it increases as the LLM's ability to correctly interpret the premises from their linguistic expressions decreases (i.e., as $1-\Psi(c^*_1, c^*_2 \mid L_1, L_2)$ increases).

Proposed Solution: LoT (Language of Thought) Prompting

To address these issues, the paper introduces a prompt-level intervention called LoT. The core idea is to guide the LLM to "observe, expand, and echo all the relevant information." This is designed to reduce the term $(1-\Psi(c^*_1, \dots, c^*_i \mid L_1, \dots, L_i))$ in Theorem 3.2, improving the LLM's understanding of the premises.

The LoT prompt has specific components targeting different types of implicitness:

Echo Intervention (for q-Implicitness): Encourages the LLM to identify and reiterate key information relevant to the task.
1
(Think step by step.) Let's **observe** and **echo** all the relevant information.
Expanding Intervention (for L-Implicitness): Prompts the LLM to rephrase or generate alternative expressions for the given information, potentially finding more explicit forms.
1
(Think step by step.) Let's **observe** and **expand** all the relevant information.

The full LoT method combines these. Two variants are evaluated:

LoT_1:

1	Please expand all the relevant information, and echo them based on the question

LoT_2 (main proposed method):

1	Please observe, expand, and echo all the relevant information based on the question

This prompt is intended to be combined with standard reasoning prompts like "Let's think step by step" (Chain-of-Thought, CoT).

Experimental Verification and Implementation Insights

1. WinoControl Dataset:

To test the hypotheses, the authors created the WinoControl dataset, derived from WinoBias. This dataset allows for controlled levels of L-explicitness (by adding clarifying or no sentences about the correct/wrong answer) and q-explicitness (by adding irrelevant distracting sentences).

Implementation: For L-explicitness, three levels were designed: (0) add a sentence to exclude the wrong answer, (1) add a sentence suggesting the correct answer is possible, (2) no additional sentence. For q-explicitness: (0) no distracting sentences, (1) two distracting sentences, (2) more distracting sentences.
Results:
- Standard CoT accuracy decreased as L-implicitness or q-implicitness increased, supporting Theorem 3.2.
- The Echo intervention improved performance more when q-implicitness was high.
- The Expand intervention was more effective when L-implicitness was high.
- These improvements were not strongly correlated with increased token cost; Echo sometimes used fewer tokens than CoT.

2. Designed Benchmarks (WinoBias, BBQ, Alice):

LoT was tested against baselines (Direct, CoT, RaR, LtM) on:

WinoBias (Social Bias): Measures consistency and accuracy for pronoun resolution.
- LoT_2 generally achieved the best or second-best performance, particularly in reducing bias (Delta) and improving consistency (Con).
- The Expand component was generally more effective than Echo, suggesting L-implicitness is a key challenge.
BBQ (Bias in Question Answering): Assesses bias in QA across categories like Age, Nationality, Religion.
- LoT_2 outperformed baselines in 11 out of 12 cases.
- The Echo component was significantly better than Expand, indicating strong q-implicitness (contextual confusion).
Alice Benchmark (Math/Logic Puzzle): Simple math problems prone to heuristic errors.
- LoT methods significantly improved performance over CoT, especially for GPT-4o-mini (0.5% to 8.5%) and Qwen (9.0% to 52.5%).
- The Expand component was significantly better, suggesting L-implicitness (difficulty understanding the core problem statement).

A case paper (Figure 7) illustrates how Echo might fail due to L-implicitness (misinterpreting a premise) and Expand might fail due to q-implicitness (getting misled by irrelevant context). The combined LoT prompt shows mutual benefits.

3. General Reasoning Benchmarks:

LoT was evaluated on 8 challenging reasoning tasks (GPQA, FOLIO, CSQA, MUSR, MUSIQUE, LSAT, Abductive/Deductive reasoning from ContextHub) where CoT sometimes underperforms direct prompting. Six LLMs were used (GPT-4o-mini, Llama-3.1-70B/8B, Mistral-7B-Instruct-v0.3, Claude-3-Haiku, Qwen2-72B-Instruct).

Implementation: For smaller or less instruction-following LLMs, the authors used markdown bolding (**observe**, **expand**, and **echo**) to emphasize the instructions.
Results:
- LoT generally provided consistent and significant improvements over CoT (e.g., up to 20% in GPQA).
- Larger, more instruction-capable LLMs (Llama-3.1-70B, Qwen2-72B) showed greater improvements. Smaller LLMs (Llama-3.1-8B, Mistral-7B) showed less consistent gains, possibly due to difficulties in following the more complex LoT instructions.
- On some tasks like LSAT, LoT did not always improve or slightly decreased performance, suggesting that prompting alone may not fully bridge the language-thought gap, and expansion could occasionally exacerbate biases.

Practical Applications and Considerations

Ease of Implementation: LoT is a prompt-level intervention, requiring no model retraining. It can be readily integrated into existing LLM application workflows.

# Example of integrating LoT with a base prompt
user_question = "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"
base_prompt = f"""
Context: {user_question}
Question: Based on the context, provide the answer.
Options: ...
"""

# Standard CoT
cot_prompt = base_prompt + "\nLet's think step by step."

# LoT_2 with CoT
lot_prompt = base_prompt + "\nPlease **observe**, **expand**, and **echo** all the relevant information based on the question. Then, let's think step by step."

# response = call_LLM_api(lot_prompt)

Combines with Existing Methods: LoT is designed to be used alongside other reasoning techniques like CoT.
Debiasing and Robustness: The method shows promise in reducing biases (WinoBias, BBQ) by encouraging a more thorough examination of premises and context, rather than relying on superficial cues or learned statistical biases.
Improving Complex Reasoning: By forcing the LLM to break down and re-evaluate the input information, LoT can help in tasks where understanding subtle details or overcoming misleading phrasing is crucial.
Computational Cost: While generally not adding prohibitive token costs, the "expand" part can lead to longer intermediate outputs. The impact on latency and cost should be monitored for specific applications.
Model Capability: The effectiveness of LoT, particularly its more complex instructions, is higher with more capable, instruction-following LLMs. For smaller models, simpler variants or careful tuning of the LoT prompt might be necessary.

Conclusion

The paper provides a formalization of how language expression can lead to biases in LLM reasoning, identifying a "language-thought gap." It introduces the LoT prompting technique, which encourages LLMs to "observe, expand, and echo" relevant information, thereby improving their understanding of the input and reducing biases. Extensive experiments demonstrate LoT's effectiveness across various reasoning tasks and LLMs, particularly in mitigating issues arising from implicit language. This research highlights the importance of addressing how information is presented to LLMs and offers a practical, prompt-based method to improve their reasoning fidelity.

PDF Markdown

Tweets

https://twitter.com/shivanshpuri35/status/1926955946619617511