Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 32 tok/s

GPT-5 High 36 tok/s Pro

GPT-4o 88 tok/s

GPT OSS 120B 471 tok/s Pro

Kimi K2 220 tok/s Pro

2000 character limit reached

LlamaSeg: Image Segmentation via Autoregressive Mask Generation (2505.19422v1)

Published 26 May 2025 in cs.CV

Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

Collections

Summary

The paper introduces an innovative autoregressive model that generates segmentation masks as visual tokens by integrating textual cues and visual features.
It presents a novel data annotation pipeline and the SA-OVRS dataset with over 2M annotated masks and 5,800+ unique open-vocabulary labels.
Experimental results show that LlamaSeg outperforms existing models on benchmarks, delivering enhanced mask quality and boundary accuracy.

LlamaSeg: Autoregressive Mask Generation for Image Segmentation

The paper "LlamaSeg: Image Segmentation via Autoregressive Mask Generation" (2505.19422) introduces LlamaSeg, a novel visual autoregressive framework designed to unify various image segmentation tasks through natural language instructions. This approach reframes image segmentation as a visual generation problem, representing masks as discrete "visual tokens." A LLaMA-style Transformer predicts these tokens directly from image inputs, integrating segmentation tasks into autoregressive architectures via next-token prediction. To facilitate large-scale training, the authors introduce a data annotation pipeline and the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or textual descriptions. Additionally, the paper proposes a composite metric combining Intersection over Union (IoU) with Average Hausdorff Distance (AHD) to evaluate mask quality more accurately.

Methodology

LlamaSeg's framework consists of a mask tokenizer and an autoregressive model (Figure 1). The mask tokenizer, based on VQGAN [esser2021taming], converts segmentation masks into discrete tokens. The autoregressive model, built on the LLaMA architecture [touvron2023llama, touvron2023llama2], takes encoded image sequences as input and generates mask tokens. Language instructions provide object cues, unifying general semantic segmentation and language-guided segmentation. The model concatenates textual and visual features into a unified sequence for simultaneous processing.

Figure 1: Overall framework of LlamaSeg, comprising a mask tokenizer and an autoregressive model, with GT labels used for training supervision.

For training, the ground truth mask is encoded into code indices, serving as targets for the LLaMA model. During inference, the generated mask tokens are decoded to produce the segmentation mask. The image and text features are extracted using frozen SigLIP2 [tschannen2025siglip] and projected to align with the LLaMA embedding dimension via trainable adaptors. Learnable separators distinguish between text, image, and mask tokens.

Data Annotation Pipeline

The authors address the limitations of existing segmentation datasets by introducing a new data annotation pipeline and the SA-OVRS dataset (Figure 2). This dataset includes 2M segmentation instances annotated with open-vocabulary labels or diverse textual descriptions, covering over 5,800 unique labels. The pipeline uses SA-1B [kirillov2023segment] dataset masks and employs Qwen2-VL-72B [wang2024qwen2] as the captioning model to generate open-vocabulary labels. GroundingDINO [liu2024grounding] then matches these labels with corresponding masks, using a rigorous matching and filtering algorithm to ensure data validity. The pipeline also generates unique textual descriptions for each mask, enhancing data diversity through referring and reasoning segmentation data.

Figure 2: The two-stage data generation pipeline involves generating labels, matching them with masks, identifying discrete entities, and creating textual data based on the labels.

Experimental Results and Analysis

The model was evaluated on closed-set and open-vocabulary semantic segmentation tasks using mean IoU (mIoU), and on referring segmentation tasks using cumulative intersection over cumulative union (cIoU). To evaluate contour accuracy, the authors proposed average Hausdorff Distance ( $d_{AHD}$ ) under multi-level IoU thresholds. Experimental results demonstrate that LlamaSeg outperforms existing visual generative models across multiple benchmarks and yields more detailed segmentation masks (Figure 3).

Figure 3: A visual comparison demonstrates LlamaSeg's superior mask quality and segmentation results compared to existing visual generative models.

Specifically, with 0.77B parameters, LlamaSeg outperforms Unified-IO by 10.6/3.8/11.8 on the ADE20K, COCOStuff, and PAS-20 datasets, respectively. Furthermore, LlamaSeg demonstrates superior alignment with ground truth contours, indicating accurate segmentation of object boundaries, as evidenced by lower mAHD values compared to Unified-IO models. Ablation studies on the mask tokenizer and decoding strategies further validate the effectiveness of the proposed approach. Analysis of attention maps reveals that the model can learn 2D spatial relationships from 1D tokens (Figure 4).

Figure 4: Different implementations of image segmentation in autoregressive frameworks, with visual tokens as masks in LlamaSeg.

Implications and Future Directions

The research demonstrates the potential of autoregressive models for fine-grained visual tasks with deterministic ground truth. The introduction of the SA-OVRS dataset addresses a critical need for large-scale, diverse segmentation data suitable for training autoregressive models. The composite evaluation metric, combining IoU and AHD, provides a more nuanced assessment of mask quality. Future research directions could explore the application of LlamaSeg to other vision tasks, further scaling the model and dataset, and investigating alternative mask tokenization strategies.

LlamaSeg: Image Segmentation via Autoregressive Mask Generation (2505.19422v1)

Collections

Summary

LlamaSeg: Autoregressive Mask Generation for Image Segmentation

Methodology

Data Annotation Pipeline

Experimental Results and Analysis

Implications and Future Directions

Paper Prompts

Follow-up Questions

Authors (6)

YouTube

LlamaSeg: Image Segmentation via Autoregressive Mask Generation (2505.19422v1)

Collections

Summary

LlamaSeg: Autoregressive Mask Generation for Image Segmentation

Methodology

Data Annotation Pipeline

Experimental Results and Analysis

Implications and Future Directions

Paper Prompts

Follow-up Questions

Related Papers

Authors (6)

YouTube