Generality of the RT–DP tradeoff observed on PITA

Determine whether the observed reversal in length generalization—where reasoning-trace–finetuned models outperform direct-prediction models on breadth-dominated PITA splits but underperform on depth-dominated PITA splits—persists beyond the PITA propositional-logic dataset, and characterize the mechanism governing the tradeoff between long-trace failure modes and generalization strengths across tasks.

Background

The paper reports a consistent reversal in length generalization between reasoning trace (RT) and direct prediction (DP) models across PITA splits: RT models excel on broad and shallow ("boule") splits, while DP models excel on narrow and deep ("baguette") splits. This is counter to common intuitions that RTs should help on tasks with salient step-wise structure.

The authors note that failures of long reasoning traces (e.g., long-context processing issues and exposure bias) could intuitively explain RT underperformance on deep tasks, but it remains unresolved how this tradeoff operates overall and whether the phenomenon is specific to PITA or generalizes more broadly.

References

It remains unclear how this tradeoff operates between and long-trace failures and generalization strengths, and whether these results are idiosyncratic to PITA or indicative of broader phenomena.

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces  (2602.14404 - Tong et al., 16 Feb 2026) in Section 2.4, Experimental results