DART: Real-Time Object Detection
- The paper introduces DART, a framework that combines automated open-vocabulary annotation, sensor fusion, and promptable detection for real-time object recognition.
- DART employs a shared visual backbone and batched text or visual queries to achieve O(1) multi-class inference, significantly reducing end-to-end latency.
- Practical implementations demonstrate substantial accuracy and speed enhancements in industrial and surveillance applications, using modular designs that support scalability and rapid adaptation.
Detect Anything in Real Time (DART) refers to a set of frameworks, pipelines, and practical systems for open-world, promptable, or rule-based detection of arbitrary entities or events in real-time vision or sensor streams. DART encompasses both generic object detection systems that operate at low latency across diverse categories using natural language or visual queries, and specialized real-time alerting architectures (such as smart intruder systems) that fuse multiple sensor modalities for rapid notification. The field spans training-free wrappers for promptable detection, automated open-vocabulary annotation and pseudo-labeling pipelines, and hardware-efficient, unified architectures for detection and segmentation.
1. Conceptual Foundations and System Typologies
Early DART systems, such as iDART, were primarily focused on modular sensor fusion for home or perimeter security using threshold-activated event detection and low-latency notification (Kumar et al., 2015). In contrast, modern DART pipelines typically refer to promptable, open-vocabulary computer vision architectures that can detect objects of arbitrary, user-specified classes via natural language or image-based prompts at real-time speeds. The unifying theme is minimizing end-to-end latency for actionable outputs, while supporting broad extensibility to unseen classes, environments, or sensor types.
The spectrum of DART approaches includes:
- Pure rule-based, multi-sensor fusion with real-time alerting (e.g., ultrasonic sensors, beam-break detectors, and cameras)
- Automated training pipelines leveraging generative, open-vocabulary, and large multimodal models for object detection without manual annotation (Xin et al., 2024)
- Promptable, transformer-based architectures that support single-pass, multi-class detection via shared image features and batched text queries (Turkcan, 12 Mar 2026)
- Unified networks achieving arbitrary-prompt object detection and segmentation in a single, hardware-optimized model (Wang et al., 10 Mar 2025)
2. Automated End-to-End Detection Pipelines
A canonical modern DART pipeline automates every stage from data collection to real-time deployment, eliminating the need for manual labeling or curation. The "DART: An Automated End-to-End Object Detection Pipeline" framework (Xin et al., 2024) defines a modular architecture with the following four stages:
- Data Diversification: Instance-level fine-tuning of a subject-driven image generator (DreamBooth with SDXL) using ∼20–50 real images per category, then generating hundreds of synthetic, photorealistic variants for each object with diverse scene, lighting, and pose prompts.
- Open-Vocabulary Annotation: Pseudo-labels are generated by applying an open-vocabulary detector (e.g., Grounding DINO) to both synthetic and real images. Multiple textual prompts per class (original, synonyms, co-occurring objects) are used, with boxes filtered by confidence and class-agnostic NMS.
- Pseudo-Label Review: Large multimodal models (LMMs) such as InternVL-1.5 (for image realism) and GPT-4o (for bounding box quality) review the outputs, rejecting any image or box not meeting strict precision, recall, and fit criteria.
- Model Training: A composite real (1 part) and synthetic (3 parts) dataset is used to fine-tune real-time detectors such as YOLOv8 and YOLOv10, after which all auxiliary models are discarded.
This architecture enables fully automated, scalable pipeline construction for industrial object detection across arbitrary classes with reported average precision (AP₅₀–₉₅) gains from 0.064 to 0.832 on a 23-class, 15K-image construction machinery dataset (see Table 1) (Xin et al., 2024).
| Stage | Main Tools | Output/Effect |
|---|---|---|
| Data Diversification | DreamBooth, SDXL, prompt sampling | Synthetic images per object class |
| Annotation | Grounding DINO, open-vocab prompts | Multi-class, prompt-aligned bounding boxes |
| Pseudo-Label Review | InternVL-1.5, GPT-4o | Verified, high-quality boxes and images |
| Training | YOLOv8, YOLOv10 | Fast, accurate open-vocab object detector |
3. Real-Time Promptable Detection Architectures
Promptable DART frameworks transform prompt-in-the-loop models (e.g., SAM3) into genuinely real-time, multi-class detection systems. The core insight, as in (Turkcan, 12 Mar 2026), is that the visual backbone is class-agnostic: image features can be computed once then reused for batched, multi-class cross-modal inference. The cumulative effect is to shift complexity from O(N) (per-class, per-pass) to O(1) per image, enabling sub-100 ms multi-class detection on commodity GPUs.
Pipeline for multi-class DART with SAM3:
- Shared Backbone: Compute feature pyramid network (FPN) features once from the ViT-H/14 backbone.
- Batching: Stack N class text embeddings; tile image features into N batches.
- Joint Inference: Run the cross-modal encoder-decoder for all classes in a single pass, discarding segmentation heads for detection-only.
- Postprocessing: Filter by presence score threshold and apply non-maximum suppression (NMS).
Empirical results show DART (via this method) reaches 55.8 AP at 15.8 FPS for 4 classes at 1008×1008 resolution on RTX 4080, representing a 5.6× speedup over naive prompt-in-the-loop inference at equivalent accuracy (Turkcan, 12 Mar 2026).
| Model Variant | Classes | AP | FPS | Hardware |
|---|---|---|---|---|
| DART (SAM3/VIT-H, O(1)) | 4 | 55.8 | 15.8 | RTX 4080 |
| DART (Student, Adapter) | 4 | 38.7 | >50 | RTX 4080* |
*Adapter/Student backbone, ~2.5× faster, lower AP.
4. Unified Models for Arbitrary-Prompt Detection and Segmentation
The YOLOE framework (Wang et al., 10 Mar 2025) provides a single model architecture capable of “real-time seeing anything”—accepting text, visual, or prompt-free queries—delivering open-world detection and segmentation at high frame rates:
- Backbone/Neck: CSP-Darknet or equivalent with PAN fuses multi-scale features.
- Heads: Parallel heads predict bounding boxes, instance masks (YOLACT-style), and dense object embeddings.
- Prompt Mechanisms:
- Text (RepRTA): CLIP-based regions aligned with reparameterized text features; zero overhead at inference.
- Visual (SAVPE): Spatial masks encoded via dual semantic and activation branches; merged into global prompt vectors.
- Prompt-free (LRPC): Specialized objectness embedding; only anchors above a threshold are matched to a built-in 4,585-category vocabulary.
The result is a model achieving ∼305 FPS (YOLOE-v8-S, T4) at 27.9 AP on LVIS zero-shot detection (text mode)—an increase of 3.5 AP and 1.4× speed over previous SOTA—while also supporting efficient prompt-free detection (Wang et al., 10 Mar 2025).
5. Real-Time Multi-Modal Alerting Systems
Earlier instantiations of DART, exemplified by iDART, integrate multimodal sensor processing on microcontrollers and single-board computers for physical event detection and alerting (Kumar et al., 2015):
- Sensor Fusion: Ultrasonic rangefinders, beam-break (laser+LDR), and optional cameras.
- Rule-Based Detection: Thresholding and event interrupts on sensor signals (e.g., presence if ultrasonic distance < ; intrusion flag if beam broken).
- Real-Time Communication: MCU triggers ZigBee or Wi-Fi messaging to a central PC; rapid notification via SMTP email, with future extension to SMS/API push.
- Performance: Typical lab results report TPR ≈ 98%, FPR ≈ 5%, mean alert latency ~1.3 s.
Limitations include restricted video coverage and susceptibility to notification loss if reliant solely on email, though modular design enables future enhancement with ML-based vision and distributed, multi-camera networks (Kumar et al., 2015).
6. Comparative Performance and Extensibility
DART systems consistently achieve state-of-the-art accuracy and deployment efficiency for both open-vocabulary and class-agnostic detection:
- On industrial datasets (e.g., 23-class construction machinery), the end-to-end DART pipeline raises AP₅₀–₉₅ from 0.064 (manual labeling) to 0.832 and reaches real-time inference rates >400 FPS with YOLOv10n (Xin et al., 2024).
- Promptable DART on COCO val2017 attains 55.8 AP at 15.8 FPS using a 439M-parameter backbone (Turkcan, 12 Mar 2026).
- YOLOE achieves “Detect Anything in Real Time” with substantial speed and accuracy improvements over existing prompt-driven architectures, both in zero-shot and transfer settings (Wang et al., 10 Mar 2025).
Modular designs across all modern DART variants allow:
- Seamless upgrading of backbone, prompt, and decoder modules
- Easy introduction of new object categories via simple image collection/tagging
- Integration of alternative data generators, large language/multimodal reviewers, and downstream detectors
- Fully automated workflows post initial prompt and category design
7. Directions, Challenges, and Future Trends
DART represents convergence across generative modeling, vision-language interface design, and hardware-efficient execution. Current limitations include reliance on underlying large models (potential bottlenecks for edge deployment), label noise in pseudo-labels absent high-quality LMM review, and dataset or domain drift for physical sensor-based alerting. Research suggests increasing robustness via advanced background modeling, distributed sensor fusion, rapid adaptation to novel categories/environments, and automated feedback loops guided by confidence or human-in-the-loop review (Turkcan, 12 Mar 2026, Xin et al., 2024).
Within practical domains—industrial QA, surveillance, home automation—the trajectory is toward intelligent, fully-autonomous systems that require minimal manual intervention while maintaining high accuracy and low latency for arbitrary detection and alerting scenarios.