Exploring Open-Vocabulary Object Recognition in Images using CLIP

Published 6 Mar 2026 in cs.CV | (2603.05962v1)

Abstract: To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.