YOLOE: Real-Time Seeing Anything

Published 10 Mar 2025 in cs.CV | (2503.07465v1)

Abstract: Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly LLM dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

YOLOE: A Unified Model for Real-Time Object Detection and Segmentation in Open Scenarios

The paper "YOLOE: Real-Time Seeing Anything" introduces YOLOE, a pivotal advancement in the field of computer vision, particularly focusing on object detection and segmentation across open-set scenarios. While traditional models like the YOLO series demonstrate robust performance in predefined closed category settings, YOLOE seeks to extend these capabilities to open environments by integrating diverse open prompt mechanisms into a single efficient model.

Methodological Innovations

YOLOE is designed to handle three types of inputs: textual, visual, and prompt-free, each addressed through novel strategies:

Re-parameterizable Region-Text Alignment (RepRTA): To enable textual prompts, RepRTA enhances pretrained textual embeddings during training using a lightweight auxiliary network. This improvement ensures better visual-text alignment without adding overhead during inference or deployment.
Semantic-Activated Visual Prompt Encoder (SAVPE): For processing visual prompts, YOLOE employs SAVPE with decoupled semantic and activation branches. This innovative design manages to encode visual cues with minimal complexity, improving the accuracy of visual embedding.
Lazy Region-Prompt Contrast (LRPC): Addressing scenarios without explicit prompts, LRPC utilizes a built-in vocabulary for category retrieval, thus removing the dependency on LLMs. This strategy efficiently identifies and labels all objects present in an image.

Strong Numerical Results

YOLOE exhibits notable performance enhancements over existing models. On the LVIS dataset, YOLOE-v8-S achieves a 3.5 AP increase over YOLO-Worldv2-S with three times less training cost, showcasing 1.4 times efficiency in inference speed. Furthermore, when transferred to COCO, YOLOE-v8-L records a gain of 0.6 AP $^b$ and 0.4 AP $^m$ over closed-set YOLOv8-L, with nearly four times less training duration.

Implications and Future Directions

The research lays groundwork for future developments in real-time object detection and segmentation frameworks, stressing the importance of prompt-based mechanisms that can adapt efficiently to open environments. The ability to integrate multiple types of prompts into a single architecture not only enhances the model’s applicability but also bolsters its deployment in practical scenarios, such as autonomous navigation or interactive robotics.

YOLOE’s open prompt-driven design anticipates wider implications in AI development by supporting real-time adaptation to diverse environmental contexts without exhausting computational resources. Future directions may involve refining the alignment models, expanding the vocabulary sets for better category naming precision, and exploring deeper integration with generative models for enhanced descriptive capabilities.

Conclusion

YOLOE represents a significant step in overcoming existing barriers in object detection and segmentation within open domains. By fusing prompt-driven mechanisms into a unified model with exceptional efficiency, YOLOE sets a new benchmark for subsequent research endeavors in the field, inviting exploration of its widespread applicability and potential expansion to other vision-related tasks.

Markdown Report Issue