Efficient Reasoning via Thought Compression for Language Segmentation

Published 2 Apr 2026 in cs.CV | (2604.02040v1)

Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces WISE, which decouples detailed training reasoning from concise inference to achieve up to 10.6× token compression and improved accuracy.
It employs an autoregressive formulation with a self-distillation objective that optimizes semantic fidelity and penalizes verbosity.
Experimental results demonstrate state-of-the-art IoU scores, lower latency, and robust performance across multiple benchmarks.

Efficient Reasoning via Thought Compression for Language Segmentation

Introduction: Motivations and Challenges

Language-guided segmentation has advanced from grounding simple referring expressions to executing complex, compositional visual reasoning. In this setting, recent methods incorporating Chain-of-Thought (CoT) prompting within Large Multimodal Models (LMMs) have demonstrated superior reasoning and zero-shot generalization capacity. However, such improvements are coupled with prohibitively high inference costs due to verbose rationalization, creating a bottleneck for real-world deployment in latency-sensitive applications.

This paper introduces WISE (Wisdom from Internal Self-Exploration) as a systematic approach to efficient reasoning for language segmentation. WISE explicitly decouples the learning phase—where detailed explanations may drive policy improvement—from the inference phase, where concise reasoning chains are critical for scalability and rapid response.

Figure 1: WISE achieves a superior cost-performance trade-off by decoupling reasoning for learning and inference. The base model reduces reasoning cost by 4.7× yet outperforms the baseline; WISE-S achieves up to 10.6× compression and state-of-the-art accuracy.

Methodology: Structured Generation and Thought Compression

The key innovation lies in restructuring the reasoning policy to generate sequential output as $(\tau_c, A, \tau_d)$ , where $\tau_c$ is a concise rationale, $A$ the answer (e.g., geometric prompt), and $\tau_d$ a detailed explanation. The sequence ordering is deliberately chosen to ensure autoregressive dependence: the detailed chain is strictly conditioned on the concise rationale, enforcing $\tau_c$ as a sufficient and information-rich statistic.

A pivotal aspect of WISE is the self-distillation objective, which includes (1) a semantic fidelity term (cosine similarity between embeddings of $\tau_c$ and $\tau_d$ ), and (2) a conciseness penalty (normalized token reduction). This reward shaping is only activated when the policy achieves correct grounding (IoU $>$ 0.5), preventing distillation of erroneous chains.

During inference, the model omits $\tau_d$ and, for robust activation of concise policies, leverages WISE-S: a brevity-focused prompting strategy that explicitly requests short rationales. This simple modification mitigates the conditional distribution mismatch between training and inference.

Figure 2: Overview of WISE; concise rationale is generated first, followed by answer and detailed rationale during training but omitted at inference time.

Figure 3: The prompting strategy specifies how different combinations of rationale instructions control the output form across training and inference.

Experimental Results: Efficiency, Accuracy, and Robustness

WISE and WISE-S report state-of-the-art performance on ReasonSeg, obtaining 60.3 gIoU / 58.5 cIoU and 60.3 gIoU / 58.3 cIoU on the test set, surpassing all prior methods—while concurrently reducing average reasoning lengths by nearly 5 $\times$ (23 tokens vs. 112 for Seg-Zero). The gains persist across multiple benchmarks (RefCOCO, RefCOCO+, RefCOCOg), confirming robustness to task shift and annotation sparsity.

WISE-S reduces wall-clock latency by 5 $\tau_c$ 0 compared to RL-based baselines. Notably, control experiments show that simplistic brevity prompts applied to non-distilled baselines yield marginal improvements, underscoring that the improvement derives from the distilled concise policy rather than mere prompt engineering.

Figure 4: The reasoning token length distribution; WISE-S (green) is tightly concentrated and concise, while Seg-Zero (purple) is long-tailed and variable.

Figure 5: Qualitative comparison on a complex instruction—WISE-S achieves successful concise reasoning and target prediction, contrasting with baseline's verbose and ultimately incorrect chain.

Analysis: Ablation, Generalization, and Limitation

Ablation studies verify that generation order $\tau_c$ 1 is essential for effective compression; reversing this order fails to induce abstraction and does not yield token savings. Both the similarity and compression components of the self-distillation reward are indispensable for robust, high-fidelity summarization.

In generalization experiments (VisionReasoner suite: detection, segmentation, counting), WISE-S consistently compresses rationales and, in most cases, increases accuracy, demonstrating cross-task scalability. Compression does not induce additional reasoning failures; rather, when errors occur, concise rationales faithfully summarize the chain's core logic, with limitations attributed to backbone grounding, not the distillation mechanism per se.

Figure 6: WISE-S acts as a sufficient summary for successful identification (left: affordance-based; right: discriminative attribute), whereas verbose baselines incur reasoning drift or redundancy.

Theoretical and Practical Implications

The results establish that deep chain-of-thought reasoning and inference efficiency are not inherently opposed. By enforcing a “concise-first” generation order with autoregressive conditioning and hierarchical self-distillation, LMM-based segmentation policies can internalize effective yet substantially compressed reasoning. The findings indicate that explicit reasoning remains beneficial for generalization, yet its computational burden is not a fixed cost—sufficient and lossless compression is achievable through structured learning objectives.

Practical implications span sustainable AI (reduced energy per inference), deployment scalability (lower latency for robotics and interactive agents), and improved alignment with reinforcement learning-based optimization where verbose rationales are a debugging nuisance rather than benefit.

Conclusion

WISE addresses the high inference cost of CoT-based segmentation by reordering the generation sequence and introducing a self-distillation objective that efficiently compresses the reasoning process. The concise rationales learned under this paradigm achieve superior accuracy and drastically reduced token budgets, eliminating the perceived trade-off between reasoning depth and computational efficiency. This paradigm poses a promising template for broader adoption in structured multi-step reasoning tasks across vision-language and other multimodal domains.

Markdown Report Issue