Explaining prioritized exploration dynamics in Coconut

Explain the training dynamics that cause models trained with Coconut and Coconut-BFS to converge to similar exploration strategies and to allocate disproportionate attention and representation weight to frontier and optimal nodes during continuous-thought reasoning.

Background

Empirical analyses show that trained models using continuous thoughts allocate elevated attention to reachable, frontier, and especially optimal edges and nodes, suggesting a prioritized search behavior.

Both Coconut (trained on optimal paths) and Coconut-BFS (trained on uniformly sampled frontier nodes) achieve near-perfect accuracy and exhibit similar exploration patterns, raising questions about the underlying training dynamics that produce this behavior.

References

We leave explaining this behavior from the perspective of training dynamics as future work.

— Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought (2505.12514 - Zhu et al., 18 May 2025) in Section 5.4 (Exploration Priority)

Explaining prioritized exploration dynamics in Coconut

Sponsor

Background

References

Related Problems