TabCodeRL: Table-Aware Reinforcement Learning
- TabCodeRL is a reinforcement learning framework that enhances table question answering by guiding LLMs through verifiable, code-based reasoning.
- It employs a two-stage training process—supervised fine-tuning followed by reinforcement learning with dense, interpretable reward metrics.
- Empirical results show significant accuracy improvements on ReasonTabQA, demonstrating effective multi-table reasoning and sample efficiency.
TabCodeRL is a reinforcement learning methodology designed to strengthen the table reasoning capacity of LLMs by guiding logical reasoning path generation thorough table-aware, verifiable reward functions. Originating within the ReasonTabQA benchmark for Table Question Answering (TableQA) in real-world industrial scenarios, TabCodeRL aims to address reasoning over multi-table structures and nested headers at scale, providing dense, interpretable supervision signals for code-generating models in both answer-focused (“no‐thinking”) and stepwise reasoning (“thinking”) QA paradigms (Pan et al., 12 Jan 2026).
1. Framework and Stages of TabCodeRL
TabCodeRL is divided into two primary sequential phases:
- Cold-Start Supervised Fine-Tuning (SFT): The initial policy is fine-tuned on gold-standard <table, question, reasoning-process> tuples sourced from ReasonTabQA. Two model variants are constructed: “no-thinking” (predicts direct answers) and “thinking” (generates multi-step code-based reasoning chains).
- Reinforcement Learning with Verifiable Rewards (RLVR): The DAPO algorithm iteratively improves the fine-tuned policy. TabCodeRL samples groups of candidate outputs per question, executes each extracted candidate Python code snippet, and applies a composite reward metric to steer the underlying LLM toward accurate, table-grounded logical reasoning.
The system essentially transforms code-generation for table QA into an episodic Markov Decision Process, operating on serialized table-question contexts, autoregressive token generation, and reward calculation post output completion.
2. Markov Decision Process Formulation
TabCodeRL models language generation as the following episodic MDP:
- States (): Each state is comprised of all tokens generated so far plus the structured context from the serialized input table and question . Specifically, and .
- Actions (): At each step, the model selects the next token from vocabulary .
- Transitions: Deterministic token appending ().
- Termination: Sequence ends on the end-of-sequence token.
- Reward: All transitions yield zero reward. The terminal (episodic) reward is computed only after full completion and validation of the generated output (see Section 3 for full formula).
This MDP formulation accommodates batch sampling, enabling dense policy optimization and empirical assessment of candidate code outputs relative to the reasoning demands of complex industrial tables.
3. Table-Aware Verifiable Reward Mechanisms
Verifiable reward design is central to TabCodeRL. The total reward for a candidate output is a linear combination of three interpretable components (Table 1):
| Reward Component | Description | Range |
|---|---|---|
| Piecewise code correctness (extraction, execution, answer match) | {0.0, 0.5, 1.0, 3.0} | |
| F1 score over extracted and gold table paths | [0, 1] | |
| Code similarity to batch-correct codes (CodeBLEU) | [0, 1] |
- Piecewise Code Correctness: Assesses extraction success, execution validity, and correctness of predicted answer () compared to gold ().
- Table-Path Selection: Measures F1 overlap between retrieved and target table-access paths (e.g., sheet navigation, row selection).
- Inner-Group Code Similarity: Penalizes incorrect outputs less if they closely resemble correct implementations—intra-batch similarity via CodeBLEU; if .
The weighted sum, , with and , delivers a dense, multi-faceted supervisory signal fine-tuned for real-world TableQA logic.
4. Training Loop and Policy Optimization
TabCodeRL’s policy optimization protocol leverages the DAPO variant of PPO, utilizing group sampling and reward normalization across batches. The high-level workflow includes:
- Initialization: Policy via SFT on thinking-mode ReasonTabQA data.
- Sampling: For each RL epoch, sample candidate completions for a given and table .
- Code Extraction and Execution: Extract code from , execute it, and record resulting answer .
- Reward Computation: Compute per candidate, derive batch mean and std , and calculate advantage terms.
- Clipped PPO-Style Update: Update by minimizing the clipped-objective loss across sequence tokens and candidates.
The use of a shared transformer backbone, a linear value head for advantage estimation, and the absence of extra table-specific encoders epitomizes the reward-centric approach driving TabCodeRL’s performance.
5. Model Architecture and Input Handling
The primary model instantiation is Qwen3-8B-Instruct—a 32-layer transformer with 8 billion parameters. Architectural features include:
- Extended Context Window: 16,384 tokens allocated for both prompt and output, necessary for parsing large-scale industrial tables.
- Table Serialization: Tabular data is flattened into row-by-row text, with nested headers resolved to linear form.
- Tokenization: Standard byte-pair encoding enables seamless handling of NL and code tokens.
- Advantage Estimation: Value head atop transformer facilitates online baseline estimation within PPO/DAPO updates.
TabCodeRL avoids custom graph encoders or specialized table attention, instead leveraging reward-driven gradients to guide reasoning path synthesis.
6. Empirical Performance and Analysis
TabCodeRL demonstrates marked empirical gains under the ReasonTabQA experimental regime:
- Accuracy Improvement: On ReasonTabQA, base Qwen3-8B (“thinking mode”) achieves 49.87% accuracy; TabCodeRL elevates this to 61.89% (+12.02 pp), surpassing all other open-source models.
- Ablation Insights: From the no-thinking baseline (40.97%), SFT alone yields 53.40%; adding DAPO gives 55.14%; and full TabCodeRL delivers 58.01%. RL and verifiable rewards each contribute incremental improvement (~5–6 pp and ~3 pp respectively).
- Cross-Benchmark Robustness: TabCodeRL augments generalization performance on WTQ, AIT-QA, MiMoTable, and HiTab by approximately 3–6 pp, demonstrating transferable table reasoning logic.
- Sample Efficiency: Most RL benefits accrue within the first 10K episodes, highlighting efficiency and responsiveness of the reward structure.
A plausible implication is that composite, verifiable rewards provide more informative, dense feedback than sparsely evaluated end-task objectives, particularly in the context of structurally complex, large-scale tables.
7. Limitations and Open Challenges
Under the ReasonTabQA evaluation, several constraints are evident:
- Compute and Memory Requirements: Execution with 16K-token context and batch group sampling () is resource-intensive.
- Hyperparameter Sensitivity: Optimal selection of weights and PPO clipping parameters is vital for training stability.
- Information Loss in Serialization: Flattening complex table structures (e.g., multi-level headers) into linearized text risks omitting structural cues.
- Residual Performance Gap: TabCodeRL, while closing much of the gap to large closed-source models, does not match Gemini-3-Pro (67.58% accuracy) on ReasonTabQA, indicating that robust, scalable table reasoning in industrial settings remains unresolved.
This suggests further research should explore integrating richer table representations, advanced graph encoders, or additional structural supervision to augment reward-driven training regimes for industrial TableQA.