Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding (2504.10465v1)

Published 14 Apr 2025 in cs.CV

Abstract: Multimodal LLMs (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-LLM (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

PDF Abstract

This paper, "Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding" (Zhang et al., 14 Apr 2025 ), introduces Pixel-SAIL, a novel approach for fine-grained pixel-level understanding tasks like referring segmentation and visual prompt-based question answering. The core idea is to significantly simplify the architecture of Multimodal LLMs (MLLMs) for these tasks by using only a single transformer, moving away from the complex pipelines that combine LLMs with separate vision encoders (like CLIP) and segmentation experts (like SAM). This aligns with the recent trend of "encoder-free" SAIL (Single trAnsformer as a unified vIsion-LLM) designs.

Existing pixel-wise MLLMs, such as LISA, PixeLLM, and GLaMM, achieve strong performance but rely on integrating multiple specialized components. Pixel-SAIL demonstrates that a single transformer can achieve comparable or superior results with a much simpler design.

The paper first describes a "Plain Baseline" built on an encoder-free MLLM. This baseline attempts pixel tasks by reshaping vision tokens into features and using segmentation tokens or pooling for visual prompts. However, this naive approach suffers from poor mask quality due to the low resolution of vision tokens and struggles with visual prompt understanding because pooled patch embeddings lack sufficient semantic information.

To address these limitations while maintaining architectural simplicity, Pixel-SAIL introduces three key technical improvements:

Learnable Up-sampling Module: This module refines low-resolution visual tokens by upscaling them to a higher resolution (e.g., from $H/S \times W/S$ to $H/4 \times W/4$) using simple transposed 2D convolutions and depth-wise convolutions. This provides the necessary high-resolution features for accurate pixel-level grounding, directly impacting segmentation mask quality.
Visual Prompt Injection: Instead of pooling features, visual prompts (like masks) are mapped to special learnable "visual prompt" (VP) tokens added to the LLM's vocabulary. These VP tokens' embeddings are added directly to the vision tokens before they are processed by the main transformer. This early fusion allows the model to understand visual prompts by leveraging the corresponding special tokens within text instructions, improving performance on prompt-based tasks without needing an extra visual prompt encoder.
Dense Feature Distillation: To enhance fine-grained feature extraction and improve mask quality, especially at object boundaries, Pixel-SAIL distills knowledge from pre-trained segmentation experts (specifically, features from Mask2Former's pixel decoder and SAM2's encoder). This distillation is applied to the model's upsampled mask features and low-resolution image features using MSE loss, improving segmentation quality with only a minor increase in training cost.

For practical implementation, Pixel-SAIL can be built upon existing encoder-free MLLMs like SOLO or EVEv2. The paper demonstrates successful implementations with 0.5B, 3B, and 7B parameter sizes using Qwen2.5 or EVEv2 as the base LLM. Training is performed end-to-end on a diverse "Dataset Engine" combining various sources:

Referring segmentation datasets (RefCOCO/+/g, COCO semantic segmentation, Grandf, MUSE, Pixel2Cap structured as referring segmentation, COCO panoptic segmentation).
Visual prompt understanding datasets (Osprey, Pixel2Cap).
Automatically generated detailed object captions (using SOTA models like InternVL2.5 and Qwen2.5-VL, manually checked).
Standard VQA data (LLaVA-1.5 665k) to maintain general conversation ability.

The training loss is a combination of next token prediction, segmentation loss, and the distillation loss. Implementation details include using DeepSpeed for training efficiency, handling high-resolution images (resizing long side to 1024 while maintaining aspect ratio for SOLO-based models, resizing to 800x800 for EVEv2-based models to reduce cost), and specific hyperparameters for losses.

To benchmark performance comprehensively, the authors introduce PerBench, a new pixel-grounded understanding benchmark. PerBench addresses limitations of existing benchmarks by including:

Detailed object caption: Requires generating fine-grained descriptions, evaluated by METEOR.
Visual prompt-based Multiple Choice Question (MCQ): Assesses understanding of referenced objects' attributes and relationships in a quantitative, multi-choice format, evaluated by Accuracy.
Visual-Text Referring Segmentation (V-T RES): Requires segmenting objects specified by both visual prompts and text, evaluating complex joint understanding via cIoU and gIoU.

Experimental results show Pixel-SAIL's effectiveness. Even the smaller 0.5B Pixel-SAIL model outperforms larger 7B MLLMs with experts (like LISA-7B) on standard referring segmentation benchmarks (RefCOCO/+/g) and visual prompt understanding (region caption METEOR on RefCOCOg). The 3B Pixel-SAIL model achieves state-of-the-art results, surpassing even larger models like Sa2VA-4B, GLaMM-7B, and OMG-LLaVA-7B on these tasks and on the challenging PerBench, demonstrating stronger detailed captioning, MCQ accuracy, and V-T RES capability. Ablation studies confirm the critical contribution of each proposed component and the importance of data scaling. Visualizations show that Pixel-SAIL learns denser, more discriminative visual features compared to the base encoder-free models.

The paper highlights that Pixel-SAIL's main limitation is the scale of the current training data (less than 2M samples), suggesting that performance could be further improved by training on larger datasets, potentially billion-level masks and visual prompts, similar to those used for training large vision models or segmentors. The simple single-transformer architecture provides a promising direction for building efficient and capable pixel-grounded MLLMs for real-world applications requiring fine-grained visual understanding and interaction.