Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Code-Centric LLM Reasoning

Updated 30 June 2025

Code-centric LLM reasoning is a methodology that leverages code as data and structure to enhance logical, analogical, scientific, and legal reasoning abilities.
Integrating code in pre-training and instruction tuning produces significant performance gains on benchmarks like C-Eval, CosQA, and MBPP without harming general NLP tasks.
Dynamic code–text mixing and Chain-of-Thought prompting further refine model reasoning, yielding robust, domain-adapted capabilities for scientific QA and legal decision support.

Code-centric LLM reasoning refers to the paradigm, methodologies, and empirical findings concerning the enhancement and evaluation of LLMs’ (LLMs) reasoning abilities through the use of code—as data, as structure, and as an explicit reasoning mechanism. Recent research comprehensively explores how code data influences LLMs’ capacity for logical, analogical, scientific, legal, and code-specific reasoning, illuminating when and how the introduction of code bestows generalizable and robust reasoning capabilities.

1. Pre-training and Instruction-Tuning with Code Data

Systematic investigation distinguishes the impact of code data when introduced at different model development stages:

Pre-training with Code + Text: Models trained from scratch on a mixture of natural language (100GB text) and Python source code (~50GB CodeParrot) exhibit marked enhancement in broad, domain-agnostic reasoning abilities—not limited to code-related tasks. The improvement holds for code reasoning, logical, legal, scientific, and analogical benchmarks, with negligible negative transfer on standard NLP tasks (e.g., NLI, commonsense).
- Example: CodePanGu2.6B (2.6B params trained on code+text) outperforms PanGu13B (13B params trained on text) on C-Eval (logical), CosQA (code QA), MBPP (code gen) and others.
Instruction-Tuning with Code: Models further fine-tuned on code-oriented instructions acquire sharp, task-specific reasoning gains, particularly boosting performance in code QA and code generation (CosQA, MBPP). However, the improvements for general reasoning are less pronounced; in some cases, generalization plateaus or mildly regresses.
Dual-stage Code Integration: Incorporating code in both pre-training and instruction-tuning stages yields the strongest results on code-related tasks, suggesting cumulative reinforcement.
Model Size: While scaling up model parameters (2.6B to 13B) improves baseline performance, the deliberate inclusion of code in pre-training is a more potent vector for reasoning enhancement than increasing model size alone.

2. Evaluation Across Reasoning Domains and Tasks

Evaluation methodology spans a carefully selected array of classification and generative tasks, distributed across major reasoning domains:

Task	Domain	Dataset
Logical Reasoning	Logical Reasoning	C-Eval (Logic)
Code QA	Code Reasoning	CosQA
Code Generation	Code Reasoning	MBPP
Legal Reasoning	Legal Reasoning	JEC-QA
Scientific Reasoning	Scientific Reasoning	ScienceQA
Analogical Reasoning	Analogical Reasoning	E-KAR

Metrics: Classification accuracy (using perplexity-based answer selection), BLEU scores for generation, and accuracy improvements when enhanced with Chain-of-Thought (CoT) prompting.
CoT Prompting: All models benefit from CoT, but those pre-trained with code exhibit greater gains, indicating code’s induction of an implicit logical progression.

3. Empirical Findings and Chain-of-Thought

Quantitative results establish code’s value in diverse reasoning contexts:

CodePanGu2.6B (code+text) achieves 40.90% on C-Eval (logic) and 50.50% on CosQA, surpassing PanGu2.6B (text) and sometimes even PanGu13B (text).
In MBPP (code generation), a BLEU score jump from 0.52 (text) to 5.06 (code+text) underscores the magnitude of arithmetic and programming-specific reasoning improvement.
CoT-enhanced evaluation demonstrates substantial jumps: ScienceQA accuracy rises from 45.93% to 68.76% (PanGu2.6B text) and to 70.30% (CodePanGu2.6B).

Notably, negative transfer to unrelated NLP tasks is minimal; slight drops only occur in the most complex reading comprehension scenarios and are attributed to mixing ratios or capacity saturation.

4. Dynamic Mixing Strategy for Code and Text

Mixing strategies during instruction-tuning further refine performance:

Uniform Mixing: Fixed proportions of code and text.
Stepwise Increase: Ramping up code over the tuning process.
Stepwise Decrease: High early code ratio (e.g., 5:5), tapering off later (7:3), shown to maintain general reasoning while maximizing code-specific gains.

Interpretation: Early code exposure fosters systematic, stepwise reasoning patterns, which subsequent text-heavy phases help generalize.

Phase	Uniform	Stepwise ↑	Stepwise ↓
1	5:3	7:3	5:5
2	5:3	7:3	6:4
3	5:3	6:4	7:3
4	5:3	5:5	7:3

Stepwise decrease is empirically superior on code-intensive evaluation.

5. Applications in Science and Law

Pre-training with code+text data directly benefits multi-step, logic-intensive applications:

Scientific QA: Models are demonstrably better at solving stepwise or compositional problems in scientific question answering (e.g., ScienceQA with CoT and code).
Legal Decision Support: Legal reasoning (JEC-QA) benefits from domain-matched instruction-tuning; models become suitable for regulatory compliance, legal support, and policy reasoning with high transparency.
Robustness and Generalization: Code-centric LLMs retain competitive ability on standard language tasks, essential for multi-domain deployment.

6. Technical Foundation

Model Architecture: 32-layer GPT-style Transformer (+query layer), autoregressive with BPE vocabulary tailored for text (40k) or code+text (130k+), depending on data mix.
Training Objective: Standard next-token prediction:

$\mathcal{L} = \sum_{i=1}^{n} \log p(x_i | x_1, x_2, ..., x_{i-1}; \Theta)$

where $\Theta$ are model parameters.

Prompting/Evaluation: Perplexity-based answer selection for classification; BLEU for generation; autoregressive sampling for generative tasks.
Chain-of-Thought: Prompts are designed to solicit explicit reasoning steps, with performance reported for both standard and CoT-enhanced settings.

7. Resources and Reproducibility

Open Access: Full source code, model checkpoints, and implementation details for both Torch and MindSpore (with support for Ascend 910 hardware), including trained LLMs, are provided at https://github.com/yingweima2022/CodeLLM.
Benchmarks and Parameters: Downloadable trained weights and easy-to-reproduce evaluation scripts.

In summary, the evidence supports that code-centric data, especially when introduced at the pre-training stage, substantially amplifies both general and task-specific reasoning capabilities in LLMs without harming their general NLP competence. Dynamic code–text mixing strategies further optimize induction of stepwise, logic-driven reasoning, setting a foundation for robust, domain-adapted LLMs across scientific, legal, and technical spheres.

PDF Markdown Chat (Upgrade)