- The paper introduces SegCompass which fuses language reasoning and visual segmentation using a Sparse Autoencoder for explicit alignment.
- It employs a high-dimensional sparse concept space and a slot mapping mechanism to convert reasoning traces into localized visual masks.
- The approach combines reinforcement learning with segmentation supervision, demonstrating superior IoU metrics and enhanced interpretability.
SegCompass: Interpretable Alignment for Reasoning Segmentation via Sparse Autoencoders
Motivation and Background
Reasoning segmentationโlocalizing objects in images based on complex, compositional languageโis a fundamental capability for VLMs and MLLMs, particularly in applications such as robotics and multi-step visual analysis. Current segmentation frameworks often fall short in explicit interpretability, relying either on latent query alignments (opaque end-to-end interactions between language and perception modules) or post-hoc textual readouts (chain-of-thought, CoT, traces mapped to unconstrained location tokens). These architectures obscure the provenance of spatial segmentation decisions and hinder transparent alignment between linguistic reasoning and visual mask generation.
SegCompass addresses this interpretability gap by integrating a Sparse Autoencoder (SAE)-driven alignment pathway. This mechanism explicitly maps text-driven reasoning traces and visual features into a shared, high-dimensional sparse concept space, from which salient concepts are selected and spatially grounded via slot mapping, providing an inspectable โwhite-boxโ interface from language to mask prediction.
Methodology
Architecture Overview
SegCompass operates in an end-to-end manner, given an image-instruction pair. The backbone is a multi-modal LLM policy (such as LLaVA-1.5 or Qwen2.5-VL), which generates a CoT trace and K concentration tokens for object referents. The SAE encodes both CoT and visual tokens into a dsaeโ-dimensional sparse feature space (e.g., dsaeโ=65,536), designed for semantic disentanglement. A query codebook then selects and aggregates the activations to form K concept representations, which, along with concentration token embeddings, are fused to generate queries for a slot mapping attention module.
The slot mapper computes multi-head attention scores between queries and image keys extracted via a vision transformer backbone (e.g., SAM ViT-H), producing a multi-slot spatial heatmap and per-slot confidences. The decoder resamples this heatmap and employs a Two-Way Transformer block for mask prediction. All modules are jointly optimized under a loss coupling GRPO-based reinforcement learning (for language reasoning) and standard segmentation supervision for masks.
Interpretable Sparse Concept Pathway
The SAE surfaces active dictionary atoms as explicit index-activation pairs, revealing which sparse features are associated with each token (text or vision). Query codebook and transformer encoders further aggregate sparse concepts while preserving attribution provenance, enabling direct inspection of which semantic concepts drive slot attention and mask prediction. The slot mapper translates fused queries into observable spatial heatmaps, with slot confidences indicating reliabilityโmaking intermediate representations traceable.
Training and Optimization
SegCompass is trained across multiple standard datasets: RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO, and ReasonSeg, with both single- and multi-object instructions. SAE is pretrained to reconstruct token hidden states under a sparsity constraint, frozen during segmentation training. The overall loss couples GRPO reinforcement (for reasoning trajectory optimization) and segmentation objectives (BCE, Dice loss for mask, BCE for confidences).
Ablations demonstrate that unifying reinforcement learning (for reasoning structure) and mask supervision yields superior accuracy relative to either alone; larger vision backbones improve mask quality; segmentation-oriented rewards dominate format rewards in contributing to downstream performance.
Empirical Results
SegCompass achieves state-of-the-art performance across five challenging benchmarks, consistently matching or surpassing prior methods in cumulative and mean IoU metrics. On RefCOCO(+/g), gRefCOCO, and ReasonSeg, SegCompass outperforms latent query alignment and textual localization readout approaches, especially when scaled up to larger backbones (13B+). Notably, reinforcement learning (GRPO) markedly enhances generalization in zero-shot settings on ReasonSeg.
Strong numerical results corroborate the claim that sparse concept quality directly correlates with mask accuracy: quantitative analyses show that instance coverage by top-K SAE activations substantially exceeds random baselines, and correlation studies between SAE reconstruction and mask Dice loss reveal robust association (Pearson R>0.69).
Qualitative visualizations confirm that activated SAE tokens and multi-slot heatmap peaks tightly localize semantically-relevant regions as specified by the CoT instruction, validating the pathwayโs interpretable mapping from language to spatial perception.
Implications and Future Directions
The explicit, interpretable alignment enabled by SegCompass opens avenues for reliable reasoning segmentation: intermediate concepts can be audited, manipulated, or constrained, informing robust deployment in high-stakes environments (e.g., facilitated debugging or targeted interventions in robotics). The architecture offers a scalable paradigm for integrating sparse concept encoding with slot-based spatial grounding, enhancing controllability and transparency.
Future developments could extend SAE-driven interpretable alignment to broader multimodal reasoning tasks, couple with advanced slot attention mechanisms, and explore hierarchical segmentation driven by compositional reasoning traces. Investigations into sequential concept tracing, online concept revision, or modularity in concept space could further advance interpretable AI in vision-language systems.
Conclusion
SegCompass establishes a new paradigm for reasoning segmentation by fusing SAE-derived sparse concept encoding with interpretable spatial grounding, jointly optimized via reinforcement and mask supervision. Empirical evidence confirms that this explicit alignment mechanism not only yields high performance but also provides genuine interpretability, tightly linking reasoning provenance to dense visual perception (2605.22658). The approach sets the foundation for transparent and controllable multimodal systems, with broad practical and theoretical ramifications for the future of interpretable AI.