EM-INF: Scaling Continuous Reasoning
- The paper demonstrates that applying dropout during hidden state updates in COCONUT generates multiple latent trajectories, establishing a theoretical Pass@N upper bound for continuous reasoning.
- It reveals that continuous reasoning paths exhibit homogeneous geometric properties, rendering reward models (PRMs/ORMs) less effective than in discrete counterparts.
- The analysis suggests that incorporating task-aligned inductive biases and contrastive objectives is key to bridging the gap between theoretical potential and realized performance.
Inference-time scaling (EM-INF) denotes a family of strategies for improving model performance by allocating additional computation or search at inference, particularly through the generation and selective reranking of multiple candidate solutions. Within continuous-space reasoning, EM-INF aims to extend the empirical successes of its discrete counterpart—where multi-path sampling and downstream reranking yield pronounced gains in LLMs—into the domain of models whose reasoning operates over continuous latent trajectories. The principal paper of EM-INF for continuous-space reasoning leverages COCONUT, a continuous reasoning LLM, as a backbone to systematically adapt, analyze, and diagnose the transferability of established EM-INF methods.
1. Theoretical Motivation and Classical Mechanisms
In discrete-space reasoning (e.g., Chain-of-Thought in text LMs), EM-INF involves generating multiple candidate reasoning chains via stochastic sampling and then selecting the most promising candidate according to a process or outcome reward model (PRM/ORM). This multi-sample and rerank pipeline empirically yields appreciable accuracy gains on challenging tasks, with best-of- approaches (Pass@) tightly upper-bounding achievable performance.
The rationale for applying this paradigm to continuous-space LMs, such as COCONUT, is twofold:
- Continuous representations offer efficiency in storing, manipulating, and reasoning over latent trajectories.
- Matching the successes of discrete EM-INF could, in principle, yield performance boosts with fewer resources and enhanced flexibility.
However, essential mechanisms that underpin EM-INF in discrete space—trajectory-level diversity from stochastic decoding and re-ranking via discriminative reward models—must be carefully re-expressed for the continuous domain.
2. Adapting EM-INF to Continuous Reasoning: Dropout-Based Sample Generation
By construction, COCONUT is deterministic: its latent trajectory is fixed for a given input, barring stochasticity in the final output step. In contrast, discrete LMs exhibit diversity at every token decode. To introduce trajectory-level diversity in continuous latent space, dropout-based sampling is applied during the hidden state update stages of the continuous reasoning pipeline, while keeping answer generation deterministic. This injects controlled stochasticity directly into the process of generating latent paths.
Mathematically, for state sequence : where dropout is actively applied during computation.
This allows systematic sampling of distinct trajectories per input, enabling the empirical measurement of Pass@ (fraction of problems solved by at least one trajectory among ).
3. Evaluation: Potential and Ceiling of EM-INF in Continuous Space
Pass@ analysis reveals that, as the number of sampled trajectories increases, the likelihood of solving a given instance increases, and for sufficiently large (e.g., ), COCONUT surpasses baseline text-based CoT performance in terms of theoretical upper bound. This suggests that, given a perfect method for selecting the correct trajectory, EM-INF could confer substantial improvements in continuous space.
However, actual reranking performance using trained PRM/ORM falls short:
- Deterministic COCONUT achieves 31.08% accuracy (N=1).
- With dropout-based sampling and PRM reranking at , accuracy reaches only 33.36%, a gain of less than 2.3 points, despite Pass@N suggesting over 42.6% should be attainable.
- Similar methods in discrete LMs yield >10 point improvements on equivalent tasks.
The substantial gap between theoretical Pass@ and realized gain stems from limitations in the reward reranking step, prompting a deeper investigation of continuous reasoning path properties.
4. Geometric and Discriminative Obstacles in Continuous-Space EM-INF
Comprehensive geometric analysis of reasoning paths reveals minimal separability between correct and incorrect trajectories:
- Isotropy, sparsity, and dimensionality metrics show no statistically significant distinction between latent paths leading to correct versus incorrect solutions.
- Visualization (e.g., t-SNE), compactness, or path smoothness do not correlate with solution correctness.
- PRM/ORM F1 scores for discrimination hover near chance (54% and 51%), with high false positives.
Perturbation experiments further stress that COCONUT's latent space is highly robust: considerable noise or even fully corrupted latent thoughts often still result in correct answers by chance. This suggests that the trajectory geometry itself is largely non-informative and that reward models are deprived of usable features for effective reranking.
5. Failure of Reward Models and the Role of Inductive Bias
Analysis attributes this failure to fundamental mismatches between continuous thought space modeling and the assumptions underlying reward model verification:
- Discrete LMs implicitly leverage strong inductive biases—syntactic structure, compositionality, transparent text alignment with task sequence—that make correct and incorrect reasoning easily distinguishable by reward models.
- COCONUT and similar continuous-space LMs are optimized for final answer accuracy, with no explicit constraints or regularization on the geometric or semantic structure of latent thought trajectories.
This absence of structural or task-aligned inductive biases leads to the “homogeneous cluster” effect, whereby correct and incorrect reasoning paths are embedded in similarly-shaped regions with no reliable boundary, undermining the discriminative capability of PRMs/ORMs.
6. Mathematical and Experimental Formulations
Key training signals for PRM are derived from Monte Carlo step completions: combines cross-entropy and mean-squared error objectives; is a cross-entropy on final outcome correctness.
Despite formally correct learning frameworks, these signals are rendered ineffective by the absence of meaningful differentiators in latent space.
7. Strategic Implications and Path Forward
To unlock the utility of EM-INF in continuous reasoning, it is necessary to go beyond naive adaptation of discrete-space strategies. Explicitly incorporating task-aligned inductive biases into continuous latent path modeling is essential. Potential directions include:
- Incorporating contrastive objective functions or metric learning criteria that explicitly maximize separability between correct and incorrect latent paths during training.
- Regularization strategies promoting geometric diversity, isotropy, or explicit stepwise alignments with task structure.
- Multi-task or intermediate-step supervision to ensure that each step of the continuous reasoning process encodes semantically meaningful and task-relevant information.
These modifications would endow PRMs/ORMs with the anchor points required for effective discrimination, closing the gap between Pass@ and realized performance.
Table: Discrete vs. Continuous Inference-Time Scaling
| Aspect | Discrete (Text) | Continuous (COCONUT) |
|---|---|---|
| Sample Generation | Token-level stochastic sampling | Dropout-based hidden state sampling |
| Reranking (PRM/ORM) | Substantial, effective gains | Marginal, weak discrimination |
| Representation Geometry | Interpretable, compositional | Homogeneous, non-informative |
| Upper Bound vs. Real Gain | Narrow gap (real ≈ Pass@) | Wide gap (real Pass@) |
| Core Bottleneck | Already performant | Needs inductive bias in latent space |
Conclusion
Inference-time scaling for continuous space reasoning demonstrates theoretical headroom via diverse trajectory sampling, as evidenced by strong Pass@ upper bounds. However, these potential gains are effectively unrealizable using current PRM/ORM mechanisms due to the lack of discriminative geometric and semantic structure in continuous latent reasoning paths. Advancing EM-INF utility in this setting necessitates a design shift: training regimes must impose structural inductive biases and discriminative objectives to empower reward models, enabling verifiable, interpretable, and scalable continuous-space reasoning.