LongCoT: Why Even the Best Language Models Fail at Extended Reasoning

This presentation examines the LongCoT benchmark, a rigorous evaluation framework that exposes a critical limitation in current large language models: their inability to sustain reliable reasoning over long chains of dependent steps. Despite advances in context length and short-chain performance, even the most capable models achieve less than 10% accuracy on problems requiring extended multi-step reasoning, revealing fundamental architectural constraints that prevent deployment in complex autonomous tasks.
Script
The best language model in the world achieves less than 10% accuracy when asked to reason through long chains of dependent problems, even when every single step is something it can solve in isolation. LongCoT is a new benchmark that reveals this hidden failure mode, and it challenges everything we thought we knew about model capability.
The researchers constructed 2,500 expert problems across mathematics, chemistry, computer science, chess, and logic. Each problem has a short input prompt but demands 10,000 to over 100,000 tokens of correct reasoning output, with every step depending on earlier decisions in an explicit or implicit dependency graph.
GPT 5.2, the top performer, scored just 9.83% on the full benchmark despite generating an average of 62,000 tokens per problem. Open-source models scored near zero. When the same subproblems are presented independently without dependencies, accuracy jumps to 55%, proving that the failure is not knowledge but coordination.
Trace analysis reveals that incorrect reasoning attempts spend far more time backtracking and stuck in dead ends. Models lose track of intermediate state, drift from their initial plan, and fail to detect or propagate errors across long chains, even within their own generated context.
Adding tool use or code execution only helps when the search structure can be offloaded programmatically. Compositional reasoning, where success requires coordinating a graph of dependent subproblems, remains fundamentally unsolved. This is an architectural limitation, not a scaling or scaffolding problem.
LongCoT exposes the gap between what models can do in isolation and what they can coordinate over extended horizons. Until architectures explicitly support long-range credit assignment, state management, and compositional planning, reliable autonomous reasoning will remain out of reach. Explore this benchmark and create your own videos at EmergentMind.com.