Integrating Rpc-theoretic insights into LLM training

Determine how to leverage the error-decomposition framework for sampling-based test-time scaling in large language model reasoning—separating estimation error from model error—and the Reasoning-pruning Perplexity Consistency (Rpc) principle that combines internal LLM probabilities with self-consistency, to improve the training process of large language models rather than using these ideas only as a post-hoc inference-time procedure.

Background

The paper introduces a theoretical framework for sampling-based test-time scaling in LLM reasoning that decomposes reasoning error into estimation error and model error. It analyzes self-consistency and perplexity, identifying slow convergence for self-consistency and large model error or degraded convergence for perplexity.

To address these issues, the authors propose Reasoning-pruning Perplexity Consistency (Rpc), which integrates internal LLM probabilities into a self-consistency framework and prunes low-probability reasoning paths. Rpc is a post-hoc method that operates at inference time without modifying the model or training process.

The authors note that training-based approaches may yield larger improvements and explicitly state that it remains an open question how to translate their theoretical insights and Rpc’s design principles into methods that improve the training process of LLMs.

References

Although integrating our method with trained LLMs may provide additional performance gains, it remains an open question how to utilize our insights to improve the training process of LLMs.

— A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning (2510.15444 - Zhou et al., 17 Oct 2025) in Appendix, Section "Limitations and Future Work" (Item 2)

Integrating Rpc-theoretic insights into LLM training

Background

References

Related Problems