Generality of the RT–DP tradeoff observed on PITA
Determine whether the observed reversal in length generalization—where reasoning-trace–finetuned models outperform direct-prediction models on breadth-dominated PITA splits but underperform on depth-dominated PITA splits—persists beyond the PITA propositional-logic dataset, and characterize the mechanism governing the tradeoff between long-trace failure modes and generalization strengths across tasks.
References
It remains unclear how this tradeoff operates between and long-trace failures and generalization strengths, and whether these results are idiosyncratic to PITA or indicative of broader phenomena.
— Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
(2602.14404 - Tong et al., 16 Feb 2026) in Section 2.4, Experimental results