Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

Published 1 Jun 2026 in cs.LG | (2606.02842v1)

Abstract: Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address this challenge, we propose Spectral-Progressive Thought Flow (SpecFlow), a novel lightweight multimodal spatial reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space. By exploiting strong energy compaction, SpecFlow preserves global layout and relational structure while introducing high-frequency details only when increased spatial precision is required. To align visual state evolution with linguistic intent, classifier-free guidance enables autoregressive textual thoughts to steer flow-based updates of the visual workspace/state without expanding the context. As a result, SpecFlow maintains a bounded visual workspace whose updates depend only on the current visual state and accumulated textual trace, enabling long-horizon inference with stable latency and memory usage independent of reasoning depth. Empirical results show that SpecFlow achieves competitive or superior reasoning performance while reducing computation and KV cache costs by up to 2.1 times.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces SpecFlow, a novel paradigm that decouples inference cost from reasoning depth by evolving visual states in a fixed-size spectral domain.
It employs blockwise spectral compression and progressive frequency unmasking alongside cosine-space flow matching with classifier-free guidance for efficient, interpretable multimodal reasoning.
Experimental results show significant reductions in KV-cache usage and memory, while maintaining or improving accuracy on challenging spatial reasoning benchmarks.

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

Introduction and Motivation

Multimodal spatial reasoning requires a model to process and coordinate both language and visual inputs, especially across multi-hop inferences where intermediate spatial layouts and relational states must be tracked and updated. The prevailing paradigm, interleaved chain-of-thought (CoT) reasoning, generates a sequence of textual and visual thoughts that are appended to the context at each reasoning step. While this approach offers strong spatial grounding and interpretability, it causes the computational and KV-cache (key-value cache) costs to grow with every additional hop, leading to prohibitive memory and latency, especially when visual thoughts dominate the context with $\mathcal{O}(10^3)$ tokens per image.

Attempts to reduce this overhead include on-the-fly token pruning and compression of visual tokens, as well as projecting intermediate states into continuous or latent spaces. However, explicit pruning often discards necessary structure, whereas implicit latent representations lack interpretability and alignment with annotated spatial reasoning cues.

Spectral-Progressive Thought Flow (SpecFlow) presents a paradigm shift, addressing the above challenges by evolving intermediate visual thoughts in a fixed-size frequency (cosine) domain, and decoupling the inference cost from the number of reasoning steps. Instead of accumulating dense pixel-space visual tokens, SpecFlow maintains a bounded spectral visual workspace that is compactly updated at each hop.

Figure 1: Comparison of multimodal spatial reasoning paradigms. (a) Text-only CoT lacks spatial grounding; (b) Image–Text Co-Thought improves grounding but causes context and KV-cache growth; (c) SpecFlow updates a compact spectral visual state, enabling efficient multi-hop reasoning with stable memory.

Method: Spectral-Progressive Flow-Based Visual Workspace

Model Formulation

SpecFlow iterates between (1) autoregressive generation of textual thoughts, conditioned on factual queries and the accumulated text state, and (2) flow-based, deterministic evolution of a compact visual state in a discrete cosine domain. Each new visual thought is produced not by autoregressively generating tokens, but by integrating a learned, spectrally-masked velocity field (flow) over coefficients encoding the visual state.

Figure 2: Spectral-Progressive Thought Flow (SpecFlow) alternates autoregressive text thoughts with text-conditioned, flow-based updates of a continuous visual state. Visual states are overwritten at each hop and represented in the cosine domain with progressively activated frequency bands, enabling efficient multimodal spatial reasoning without accumulating visual tokens or growing the context length.

The key architectural innovation involves representing the visual workspace in blockwise discrete cosine space. Most of the semantic signal in visual layouts is stored in low-frequency coefficients, encoding global structure, while high-frequency components capture only incidental local detail. By using a time-dependent spectral mask $M(t)$ , SpecFlow progressively unblocks frequency bands as reasoning unfolds, starting with low frequencies for coarse layout and only then releasing higher frequencies as needed for local refinement.

Figure 3: Spectral-progressive frequency allocation with block cosine projection. (a) An intermediate visual state. (b) Partition into $b\times b$ blocks and employ block cosine projection. (c) Average coefficient energy concentrates in low frequencies. (d) Reconstruction using only the low-frequency bands preserves global layout. (e) Retaining a small subset of coefficients yields significant visual-token reduction per hop.

This coarse-to-fine spectral schedule guarantees that (i) the workspace remains spatially interpretable, and (ii) memory use is bounded and stable regardless of the number of reasoning hops.

Cosine-Space Flow Matching with Classifier-Free Guidance

The update to the spectral workspace is formulated as a deterministic ODE in coefficient space, where the learned velocity field is conditioned both on the current workspace and the hop-specific text context. Classifier-free guidance (CFG) is used to steer the dynamics, isolating the direction in velocity space attributable to the text prompt and amplifying it via a guidance scale $w$ . This ensures alignment between textual intent and visual update, which is critical for faithful spatial reasoning.

Experimental Evaluation

Benchmarks and Main Results

SpecFlow is evaluated against state-of-the-art multimodal spatial reasoning and sequential planning tasks, including VSR (claim verification), V-Star (visual search), EmbSpatial (relational grounding), Winoground (compositional alignment), Maze, MiniBehavior, and FrozenLake. On both spatial reasoning and spatial decision-making, SpecFlow achieves accuracy competitive with or superior to baselines, and does so with sharply reduced latency and KV-cache consumption.

The computation and memory advantages are most pronounced as reasoning depth increases. On compositional benchmarks, SpecFlow achieves $1.6\times$ – $1.8\times$ lower KV-cache usage compared to token-accumulating approaches, while on dynamic, long-horizon environments, reduction reaches up to $2.1\times$ .

Figure 4: Effect of CFG guidance scale on reasoning accuracy. Performance improves with increasing guidance strength and peaks at a moderate scale ( $w=4$ ), while excessively large guidance yields diminishing returns due to over-deterministic conditioning.

Figure 5: Effect of the number of ODE inference steps $T$ on reasoning accuracy. Accuracy improves with increasing $T$ and saturates at moderate step counts, beyond which additional steps yield diminishing returns and incur higher computational cost.

Ablations confirm the benefit of progressive spectral unmasking: fixed low-frequency masking offers efficiency at the expense of accuracy, while adaptive progression achieves both high accuracy and low overhead.

Visual Reasoning Trajectory Analysis

Qualitative analysis and several figures provide further insight into SpecFlow's operations. In sequential planning (e.g., Maze):

Figure 6: Success case of multi-hop Maze planning with SpecFlow. Each panel shows the intermediate visual workspace at one hop, overlaid with the current planned trajectory in red. The route is progressively extended while preserving global Maze geometry, and the final hop yields a coherent collision-free path that reaches the goal.

In MiniBehavior, the compact workspace supports tracking of subgoals and phase transitions, and for environmental navigation, SpecFlow iteratively fabricates a valid, safe trajectory without accumulating dense intermediate visual tokens.

Figure 7: Qualitative MiniBehavior example: fetch then place. We visualize multi-hop visual thoughts for a two-stage manipulation task. The red rectangle denotes the current subgoal region encoded in the workspace, which first targets the printer for pickup and then switches to the table for placement. The sequence illustrates how SpecFlow overwrites a compact workspace to track subgoals and progress, enabling multi-step execution without accumulating dense intermediate visual tokens in the autoregressive context.

Theoretical and Practical Implications

SpecFlow makes several strong claims and empirically validates them:

KV-cache and memory cost become independent of reasoning depth: The fixed-size spectral workspace discards all previous visual states at each hop, introducing an $M(t)$ 0 cost per hop and removing memory as a bottleneck in long-horizon reasoning.
Visual token accumulation is eliminated: Unlike autoregressive or token-pruning schemes that only mitigate growth, SpecFlow prevents context bloat altogether by architectural design.
Visual thought fidelity is maintained by frequency-adaptive updates: Progressive unmasking of higher frequencies occurs only when textual or spatial cues demand it, ensuring that fine-grained detail is only synthesized as needed.

On the theoretical side, representing updates in low-frequency subspaces regularizes the ODE dynamics, lowering the effective Lipschitz constant and supporting larger integration steps for efficient deterministic inference. The combination of blockwise spectral projection and a VAE latent bottleneck compounds the advantage, ensuring that the model's compute resources are focused on semantically meaningful updates.

Future Directions

SpecFlow's spectral coarse-to-fine principle is general and suggests extensions to broader generative modeling problems. For intermediate representations in visual reasoning and planning, it provides a template for representing and controlling the tradeoff between structural fidelity and compute. Integrating spectral scheduling or cosine-space evolution into high-fidelity synthesis (e.g., image generation tasks) is a promising future avenue.

Further, SpecFlow's design is compatible with parameter-efficient adaptation techniques (e.g., LoRA or adapter projection), plug-and-play with different AR language backbones, and can be deployed for efficient, interpretable, long-horizon reasoning under tight resource budgets.

Conclusion

SpecFlow demonstrates that explicit spectral-domain evolution of visual states, combined with autoregressive textual reasoning and classifier-free guidance, enables lightweight, memory-stable, and interpretable multimodal reasoning without sacrificing accuracy. By obviating the need for intermediate pixel-space token accumulation or complex token pruning heuristics, SpecFlow offers a robust and efficient framework with substantial practical and theoretical implications for the design of next-generation multimodal reasoners.

Figure 8: FrozenLake qualitative example with multi-hop visual thoughts. The sequence progressively refines a collision-free route that avoids holes and reaches the goal, demonstrating bounded-workspace long-horizon reasoning.

Markdown Report Issue