STaR: Slow-Thinking for Table Reasoning

Updated 21 November 2025

STaR is a framework that enhances LLM table reasoning by decomposing analytic tasks into verifiable chain-of-thought steps.
It employs a two-stage, difficulty-aware reinforcement learning approach with curriculum training to boost accuracy and stability.
STaR integrates trajectory-level uncertainty quantification to select the most reliable answers from diverse reasoning paths.

STaR (Slow-Thinking for Table Reasoning) is a framework designed to endow LLMs with cognitive, step-by-step reasoning abilities over tabular data, addressing key limitations of shallow pattern matching and unstable outputs that characterize standard LLM approaches to table reasoning. By incorporating a curriculum-based, difficulty-aware reinforcement learning (RL) paradigm and a principled uncertainty quantification mechanism, STaR attains superior accuracy, stability, and generalization on both standard and out-of-domain table reasoning benchmarks (Zhang et al., 14 Nov 2025).

1. Motivation and Problem Setting

Table reasoning challenges conventional LLMs with demands for precise cell retrieval, logical inference across rows and columns, and exact numerical computation. Existing LLM methods tend to generate answers without modeling intermediate steps, yielding outputs that are difficult to interpret and susceptible to instability under minor input perturbations.

STaR seeks to emulate human “slow-thinking,” wherein analytic processes are explicitly decomposed into verifiable chains of reasoning. This approach promotes interpretability and iterative self-correction, thus supporting reliable downstream analysis. The framework targets cognitive table reasoning through three principal stages: construction of self-verified chain-of-thought (CoT) datasets, two-stage reinforcement learning with curriculum, and trajectory-level uncertainty quantification.

2. Architectural Components and Training Methodology

Slow-Thinking Dataset Construction

STaR generates “chain-of-thought” demonstrations by prompting a strong model with (table, question, ground-truth answer), then filtering out any reasoning trajectories whose computed final answer diverges from ground truth. A valid demonstration adopts the template $\langle\!\mathtt{think}\rangle$ {detailed steps} $\langle\!\mathtt{answer}\rangle$ {JSON-formatted answer}, thereby ensuring only self-consistent rationales are retained.

Two-Stage Difficulty-Aware Reinforcement Learning

The training set is partitioned according to observed pass@k rates under a supervised fine-tuning (SFT) base model:

$S_{\text{easy}} = \{\text{queries} \mid \text{pass@k}_1(\text{model}_{\text{SFT}}, \text{query}) \geq 0.6\}$
$S_{\text{hard}} = \{\text{queries} \mid \text{pass@k}_1(\text{model}_{\text{SFT}}, \text{query}) < 0.6\}$

Each contains approximately 10,000 queries. A curriculum is imposed during RL: foundational training occurs on “easy” queries (stage 1), followed by progressive training on “hard” queries with dynamic sample filtering based on real-time pass@k tracking (stage 2).

Policy optimization employs an enhanced Group-Relative Proximal Policy Optimization (GRPO) objective, omitting KL penalty and using asymmetric clipping:

$\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}; o_1, \dots, o_G \sim \pi_{\theta_\text{old}}} \Biggl[ \frac{1}{\sum_i|o_i|} \sum_{i=1}^G\sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta)\hat A_{i,t},\, \mathrm{clip}(r_{i,t}(\theta),1-\epsilon_{\text{low}},1+\epsilon_{\text{high}})\hat A_{i,t} \right) \Biggr]$

Reward for each trajectory $y$ combines format correctness, partial correctness, and exact match:

$R(y)\;=\;0.2\,R_{\rm format}(y)\;+\;0.3\,R_{\rm partial}(y)\;+\;0.5\,R_{\rm complete}(y)$

3. Inference: Trajectory-Level Uncertainty Quantification

At inference, STaR samples multiple reasoning trajectories and computes the following metrics:

Token-level confidence: mean log-probability and entropy across tokens:

$\mathrm{logprob}(y)=\frac1n\sum_{i=1}^n\log p(t_i\mid t_{<i},x),\quad \mathrm{entropy}(y)=-\frac1n\sum_{i=1}^n\sum_{v\in\mathcal V}p(v\mid t_{<i},x)\log p(v\mid t_{<i},x)$

Answer-level consistency: grouping sampled trajectories by identical extracted answer, counting cardinality per answer group $G^a$ .

A fusion score $S(a)$ for each group $a$ is computed via normalized consistency, mean confidence, and maximum confidence:

$S(a) = 0.25\,\widehat{C^{\rm cons}_a} + 0.20\,\widehat{\bar C_a} + 0.55\,\widehat{C^{\max}_a}$

The final answer is selected as the highest-confidence trajectory within the highest-scoring group. Formal pseudocode for the selection is provided via a consistency-confidence fusion algorithm.

4. Empirical Evaluation and Benchmark Performance

STaR is evaluated across in-domain (WikiTableQuestions, HiTab, FinQA) and out-of-domain (TabMWP, TabFact) datasets. Exact match accuracy metrics show substantial improvements over comparable baselines. For example, STaR-8B outperforms Qwen3-8B by +8.98 pp on WTQ and +22.92 pp on HiTab; STaR-0.6B achieves competitive results relative to specialist systems.

Model	WTQ	HiTab	FinQA	TabMWP	TabFact
Qwen3-8B	83.29	70.04	26.63	64.76	90.22
STaR-8B	92.27	92.96	56.06	97.36	92.05

Ablation studies reveal contributions from RL, two-stage curriculum, and uncertainty quantification, as well as distinct performance roles for each. For instance, uncertainty quantification closes a latent +15–20 pp pass@8 vs pass@1 gap, and temperature elevation increases path diversity beneficial for out-of-domain robustness.

STaR aligns with emerging trends in reasoning-augmented LLMs. STaR-SQL adapts analogous principles for text-to-SQL synthesis, emphasizing reasoning-driven clause assembly and outcome-supervised verification (He et al., 19 Feb 2025). Process Reward Models (PRMs) such as TaTToo further advocate step-level, tool-grounded signal shaping for tabular reasoning, employing RL and dense supervision to improve policy LLM inference scaling (Zou et al., 7 Oct 2025). A plausible implication is the convergence of curriculum RL, self-verification, and tool-based supervision as foundational elements in table reasoning system design.

TaTToo presents table-grounded verification and dual-stage reward shaping, yielding superior results over purely text-based PRMs, and demonstrates sustained accuracy improvements across test-time scaling strategies.

6. Interpretations, Limitations, and Prospective Developments

STaR’s slow-thinking protocol aligns LLM reasoning patterns with human cognitive practices, fostering systematic table fact retrieval and multi-step analytic chains. Self-verification prior to answer commitment further stabilizes outputs. Trajectory-level uncertainty quantification integrates token-level and consensus signals, mitigating both overconfident-wrong and consensus-wrong failures.

Limitations include focus on single-table, text-only scenarios; transfer to multi-table joins or tables with visual structure remains unaddressed. The curriculum depends on SFT-based difficulty estimates, and automatic online estimators are an open research direction. Further, alternative reward decompositions and integration of external symbolic tools could enhance numeric computations.

This suggests that progress in cognitive table reasoning will be driven by multi-faceted protocols integrating chain-of-thought, curriculum learning, uncertainty quantification, and tool-grounded reward modeling.

7. Summary and Impact

STaR demonstrates that cognitive, slow-thinking approaches—coupled with rigorous RL training and uncertainty-aware inference—yield state-of-the-art performance and stability in table reasoning tasks, with strong cross-domain generalization. The framework’s modular design and fidelity to human reasoning paradigms have influenced related efforts in structured reasoning, SQL generation, and reward model construction. Continuing research is likely to expand these methodologies into broader classes of structured and multimodal data, establishing new benchmarks for reasoning reliability and interpretability in LLMs (Zhang et al., 14 Nov 2025, He et al., 19 Feb 2025, Zou et al., 7 Oct 2025).