SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

Published 21 May 2026 in cs.CV, cs.LG, cs.MM, and eess.IV | (2605.22658v1)

Abstract: While LLMs provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces SegCompass which fuses language reasoning and visual segmentation using a Sparse Autoencoder for explicit alignment.
It employs a high-dimensional sparse concept space and a slot mapping mechanism to convert reasoning traces into localized visual masks.
The approach combines reinforcement learning with segmentation supervision, demonstrating superior IoU metrics and enhanced interpretability.

SegCompass: Interpretable Alignment for Reasoning Segmentation via Sparse Autoencoders

Motivation and Background

Reasoning segmentation—localizing objects in images based on complex, compositional language—is a fundamental capability for VLMs and MLLMs, particularly in applications such as robotics and multi-step visual analysis. Current segmentation frameworks often fall short in explicit interpretability, relying either on latent query alignments (opaque end-to-end interactions between language and perception modules) or post-hoc textual readouts (chain-of-thought, CoT, traces mapped to unconstrained location tokens). These architectures obscure the provenance of spatial segmentation decisions and hinder transparent alignment between linguistic reasoning and visual mask generation.

SegCompass addresses this interpretability gap by integrating a Sparse Autoencoder (SAE)-driven alignment pathway. This mechanism explicitly maps text-driven reasoning traces and visual features into a shared, high-dimensional sparse concept space, from which salient concepts are selected and spatially grounded via slot mapping, providing an inspectable “white-box” interface from language to mask prediction.

Methodology

Architecture Overview

SegCompass operates in an end-to-end manner, given an image-instruction pair. The backbone is a multi-modal LLM policy (such as LLaVA-1.5 or Qwen2.5-VL), which generates a CoT trace and $K$ concentration tokens for object referents. The SAE encodes both CoT and visual tokens into a $d_{sae}$ -dimensional sparse feature space (e.g., $d_{sae} = 65,536$ ), designed for semantic disentanglement. A query codebook then selects and aggregates the activations to form $K$ concept representations, which, along with concentration token embeddings, are fused to generate queries for a slot mapping attention module.

The slot mapper computes multi-head attention scores between queries and image keys extracted via a vision transformer backbone (e.g., SAM ViT-H), producing a multi-slot spatial heatmap and per-slot confidences. The decoder resamples this heatmap and employs a Two-Way Transformer block for mask prediction. All modules are jointly optimized under a loss coupling GRPO-based reinforcement learning (for language reasoning) and standard segmentation supervision for masks.

Interpretable Sparse Concept Pathway

The SAE surfaces active dictionary atoms as explicit index-activation pairs, revealing which sparse features are associated with each token (text or vision). Query codebook and transformer encoders further aggregate sparse concepts while preserving attribution provenance, enabling direct inspection of which semantic concepts drive slot attention and mask prediction. The slot mapper translates fused queries into observable spatial heatmaps, with slot confidences indicating reliability—making intermediate representations traceable.

Training and Optimization

SegCompass is trained across multiple standard datasets: RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO, and ReasonSeg, with both single- and multi-object instructions. SAE is pretrained to reconstruct token hidden states under a sparsity constraint, frozen during segmentation training. The overall loss couples GRPO reinforcement (for reasoning trajectory optimization) and segmentation objectives (BCE, Dice loss for mask, BCE for confidences).

Ablations demonstrate that unifying reinforcement learning (for reasoning structure) and mask supervision yields superior accuracy relative to either alone; larger vision backbones improve mask quality; segmentation-oriented rewards dominate format rewards in contributing to downstream performance.

Empirical Results

SegCompass achieves state-of-the-art performance across five challenging benchmarks, consistently matching or surpassing prior methods in cumulative and mean IoU metrics. On RefCOCO(+/g), gRefCOCO, and ReasonSeg, SegCompass outperforms latent query alignment and textual localization readout approaches, especially when scaled up to larger backbones (13B+). Notably, reinforcement learning (GRPO) markedly enhances generalization in zero-shot settings on ReasonSeg.

Strong numerical results corroborate the claim that sparse concept quality directly correlates with mask accuracy: quantitative analyses show that instance coverage by top-K SAE activations substantially exceeds random baselines, and correlation studies between SAE reconstruction and mask Dice loss reveal robust association (Pearson $R>0.69$ ).

Qualitative visualizations confirm that activated SAE tokens and multi-slot heatmap peaks tightly localize semantically-relevant regions as specified by the CoT instruction, validating the pathway’s interpretable mapping from language to spatial perception.

Implications and Future Directions

The explicit, interpretable alignment enabled by SegCompass opens avenues for reliable reasoning segmentation: intermediate concepts can be audited, manipulated, or constrained, informing robust deployment in high-stakes environments (e.g., facilitated debugging or targeted interventions in robotics). The architecture offers a scalable paradigm for integrating sparse concept encoding with slot-based spatial grounding, enhancing controllability and transparency.

Future developments could extend SAE-driven interpretable alignment to broader multimodal reasoning tasks, couple with advanced slot attention mechanisms, and explore hierarchical segmentation driven by compositional reasoning traces. Investigations into sequential concept tracing, online concept revision, or modularity in concept space could further advance interpretable AI in vision-language systems.

Conclusion

SegCompass establishes a new paradigm for reasoning segmentation by fusing SAE-derived sparse concept encoding with interpretable spatial grounding, jointly optimized via reinforcement and mask supervision. Empirical evidence confirms that this explicit alignment mechanism not only yields high performance but also provides genuine interpretability, tightly linking reasoning provenance to dense visual perception (2605.22658). The approach sets the foundation for transparent and controllable multimodal systems, with broad practical and theoretical ramifications for the future of interpretable AI.

Markdown Report Issue