Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models
Abstract: Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposes "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre-trained models without requiring significant retraining, making the approach data-efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving generalization and enabling smooth and consistent spatial encoding. We perform thorough ablation studies to examine the main components of our proposed method. On the standard benchmark PASCAL-5i, the method performs better despite being trained on frozen vision and language representations. Index Terms: Beyond-Labels, open-vocabulary semantic segmentation, Fourier embeddings, PASCAL-5i
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.