Atom of Thoughts for Markov LLM Test-Time Scaling (2502.12018v2)

Published 17 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.

Summary

The paper proposes the Atom of Thoughts (AoT) framework, inspired by the Markov process, which improves LLM test-time reasoning by decomposing problems into independent, verifiable subquestions.
Experiments across six benchmarks showed significant improvements, including an 80.6% F1 score on HotpotQA, demonstrating enhanced performance and computational efficiency.
The AoT framework acts as a plug-in to existing methods, optimizing the reasoning path and managing computational resources efficiently without relying on historical dependencies.

"Atom of Thoughts for Markov LLM Test-Time Scaling" explores a novel method for enhancing the test-time reasoning capabilities of LLMs. The paper identifies issues with existing test-time scaling methods, which inefficiently retain historical information, leading to unnecessary computational overhead and interference with effective reasoning processes. The authors propose the Atom of Thoughts (AoT) framework, inspired by the Markov process, to address these challenges by iteratively decomposing a reasoning problem into independent and verifiable subquestions.

Framework and Methodology:

Markov Process for Reasoning: The reasoning process is conceptualized as a Markov process, where each reasoning state transition involves the decomposition of the current question into a dependency-based directed acyclic graph (DAG) and the contraction of its subquestions into new atomic question states.
Decomposition and Contraction: The decomposition stage involves breaking down the current question into subquestions represented in a DAG to capture structural dependencies. Subsequently, the contraction phase synthesizes these subquestions into a new question state, progressively simplifying the overall problem.
Integration with Test-Time Scaling: AoT is designed to integrate seamlessly with existing test-time scaling methods, thereby serving as a plug-in enhancement that boosts reasoning efficiency without the burden of historical dependencies.

Experimental Results:

Benchmarks: The framework was tested across six benchmarks including HotpotQA, MATH, GSM8K, BBH, MMLU-CF, and LongBench.
Performance Improvements: Notable improvements were observed, particularly in multi-hop reasoning tasks such as HotpotQA, where AoT achieved an 80.6% F1 score, outperforming comparable methods and baselines by substantial margins.
Computational Efficiency: The AoT framework demonstrated improved computational efficiency by reallocating resources away from processing irrelevant historical data and focusing on the most crucial parts of the problem-solving process.

Contributions:

AoT Framework: Proposed a novel reasoning framework that exemplifies the Markov property, allowing LLMs to manage complex problem-solving tasks with reduced computational costs.
Plug-In Capability: Highlighted how AoT can enhance existing methods by optimizing the reasoning path and efficiently managing computational resources.
Robust Empirical Evaluation: Demonstrated the framework's capability through comprehensive evaluations across multiple benchmarks, indicating substantive gains in both effectiveness and efficiency.

While the paper presents the framework as an efficient tool for reducing computational waste and enhancing reasoning processes, the authors also acknowledge potential limitations related to the absence of a reflection mechanism in handling flawed decompositions. Future research could focus on enriching AoT with adaptive mechanisms to further bolster robustness and accuracy in varied reasoning contexts.